Top Banner
IN DEGREE PROJECT MECHANICAL ENGINEERING, SECOND CYCLE, 30 CREDITS , STOCKHOLM SWEDEN 2019 A Machine Learning Approach to Predictively Determine Filter Clogging in a Ballast Water Treatment System KRISTOFFER SLIWINSKI KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF INDUSTRIAL ENGINEERING AND MANAGEMENT
62

A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

Mar 13, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

IN DEGREE PROJECT MECHANICAL ENGINEERINGSECOND CYCLE 30 CREDITS

STOCKHOLM SWEDEN 2019

A Machine Learning Approach to Predictively Determine Filter Clogging in a Ballast Water Treatment System

KRISTOFFER SLIWINSKI

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF INDUSTRIAL ENGINEERING AND MANAGEMENT

AbstractSince the introduction of the Ballast Water Management Convention ballast watertreatment systems are required to be used on ships for processing the ballast waterto avoid spreading bacteria or other microbes which can destroy foreign ecosystemsOne way of pre-processing the water for treatment is by straining the water througha filtration unit When the filter mesh retains particles it begins to clog and couldpotentially clog rapidly if the concentration of particles in the water is high Theclog jeopardises the system The thesis aims at investigating if machine learningthrough neural networks can be implemented with the system to predictively deter-mine filter clogging by investigating two popular network structures for time seriesanalysis

The problem came down to initially determine different grades of clogging for the fil-ter element based on sampled sensor data from the ballast water treatment systemThe data were then put through regression analysis through two neural networksfor parameter prediction one LSTM and one CNN The LSTM predicted values ofvariable and clogging labels for the next 5 seconds and the CNN predicted values ofvariable and clogging labels for the next 30 seconds The predicted data were thenverified through classification analysis by an LSTM network and a CNN

The LSTM regression network achieved an r2-score of 0981 and the LSTM classi-fication network achieved a classification accuracy of 995 The CNN regressionnetwork achieved an r2-score of 0876 and the CNN classification network achieved aclassification accuracy of 933 The results conclude that ML can be used for iden-tifying different grades of clogging but that further research is required to determineif all clogging states can be classified

SammanfattningSedan Ballast Water Management-konventionen introducerades har fartyg behovtanvanda barlastvattensystem for att behandla barlastvattnet i ett forsok att hammaspridningen av bakterier och andra mikrober som kan vara farliga for frammandeecosystem Ett satt att forbehandla vattnet innan behandling ar genom att lata detpassera genom ett filter Nar filtret samlar pa sig partiklar borjar det att kloggaoch kan potentiellt klogga igen snabbt nar koncentrationen av partiklar i vattnet arhog Kloggning kan aventyra systemets sakerhet Det har examensarbetet amnaratt undersoka om maskininlarning genom neurala natvark kan implementeras i sys-temet for att prediktivt bestamma filtrets kloggningsgrad genom att undersokalampligheten hos tva populara natverksstrukturer for tidsserieanalys

Problemet handlade initialt om att bedomma olika kloggningsgrader for filterele-mentet baserat pa samplade sensordata fran barlastvattensystemet Datan kordessedan for regressionsanalys genom tva neurala natverk ett av typen LSTM ochett av typen CNN for att prediktivt bestamma paramterarna LSTM-natvarketuppskattade variabelvarden och kloggningsgrad for de kommande 5 sekundrarnamedan CNNet uppskattade variabelvarden och kloggningsgrad for de kommande30 sekunderna Den uppskattade datan verifierades sedan genom klassificering avett LSTM natverk och tva CNN

LSTM natverket for regression uppnadde ett r2-resultat pa 0981 och LSTM natver-ket for klassificering uppnadde en klassificeringsgrad pa 995 CNNet for regres-sion uppnadde ett r2-resultat pa 0876 och CNNet for klassificering uppnadde enklassificeringsgrad pa 933 Resultatet visar att ML kan anvandas for att identi-fiera olika kloggningsgrad men ytterligare forskning kravs for att bestamma om allakloggningsstadier kan klassificeras

Nomenclature

ARIMA Autoregressive Integrated Moving Average

AUC Area Under Curve

BWTS Ballast Water Treatment System

CNN Convolutional Neural Network

FOR Frame of Reference

LSTM Long Short Term Memory

ML Machine Learning

MAE Mean Absolute Error

MSE Mean Squared Error

NN Neural Network

ReLU Rectified Linear Unit

RMSE Root Mean Squared Error

TSS Total Suspended Solids

Contents

1 Introduction 111 Background 112 Problem Description 113 Purpose Definitions amp Research Questions 214 Scope and Delimitations 215 Method Description 3

2 Frame of Reference 521 Filtration amp Clogging Indicators 5

211 Basket Filter 5212 Self-Cleaning Basket Filters 6213 Manometer 7214 The Clogging Phenomena 8215 Physics-based Modelling 9

22 Predictive Analytics 10221 Classification Error Metrics 11222 Regression Error Metrics 12223 Stochastic Time Series Models 14

23 Neural Networks 15231 Overview 15232 The Perceptron 16233 Activation functions 16234 Neural Network Architectures 17

3 Experimental Development 2331 Data Gathering and Processing 2332 Model Generation 26

321 Regression Processing with the LSTM Model 27322 Regression Processing with the CNN Model 28323 Label Classification 29

33 Model evaluation 3034 Hardware Specifications 31

4 Results 3341 LSTM Performance 3342 CNN Performance 36

5 Discussion amp Conclusion 4151 The LSTM Network 41

511 Regression Analysis 41512 Classification Analysis 42

52 The CNN 42521 Regression Analysis 42522 Classification Analysis 43

53 Comparison Between Both Networks 4454 Conclusion 44

6 Future Work 45

Bibliography 47

Chapter 1

Introduction

11 Background

Ballast water tanks are used on ships to stabilize the ship for different shippingloads When a ship isnrsquot fully loaded or when a ship must run at a sufficient depthwater is pumped into the ballast water tanks through a water pumping system Topreserve existing ecosystems as well as preventing the spread of bacteria larvaeor other microbes ballast water management is regulated world-wide by the Inter-national Convention for the Control and Management of Shipsrsquo Ballast Water andSediments (BWM convention)

PureBallast 3 (PB3) is a ballast water treatment system (BWTS) designed by AlfaLaval that works as an extension to the shipsrsquo water pumping system The BWTSuses a filter for physical separation of organisms and total suspended solids (TSS)and a UV reactor for main treatment of the ballast water As PB3 can be installedon a variety of ships the BWTS must be able to process different waters underdifferent conditions to fulfill the requirements of the BWM convention

12 Problem Description

In the existing system there is currently no way of detecting if the filter in use isabout to clog A clogged filter forces a stop for the entire process and the onlyway to get the BWTS functional again is to physically remove the filter disas-semble it clean it reassemble it and put it back This cleaning process involvesrisks of damaging the filter is expensive to carry out and takes up unnecessary time

Furthermore due to the different concentrations of the TSS in waters around theworld as well as different supply flows and supply pressures from various shippumping system the load on the BWTS may vary greatly This imposes problemsin estimating actual system health when using a static model as there are discrep-ancies introduced in the measured parameters

1

CHAPTER 1 INTRODUCTION

These problems make it impossible to achieve optimal operability while ensuringthat the BWTS runs safely in every environment A desired solution is to developa method which can analyze and react to changes imposed to the system on thefly One such solution could be to use the large amounts of data generated by thesystem to create a neural network (NN) to estimate the state of the BWTS

13 Purpose Definitions amp Research Questions

The use of machine learning (ML) in this type of system is a new application of theotherwise popular statistical tool As there is no existing public information on howsuch an implementation can be done the focus of the thesis and the main researchquestion is

bull To investigate and evaluate the possibility of using ML for predictively esti-mating filter clogging with focus on maritime systems

An NN model will be developed and evaluated in terms of how accurately it canpredict the clogging of the filter in the BWTS using the systems sensor data Theimplementation of an NN into a system like this is the first of its kind and willrequire analysis and understanding of the system to ensure a working realizationTherefore with the BWTS in mind the original proposed research question can bespecified further to be

bull How can an NN be integrated with a BWTS to estimate clogging of a basketfilter

14 Scope and Delimitations

In comparison to the initial thesis scope provided in Appendix A some delimita-tions have been made First and foremost the only part of interest of the BWTSis the filter When designing the NN the data used will be from in-house testingas well as from the cloud service Connectivity so no alternations to the existinghardware and software can be made For the purpose of focusing on the ML modeland the system sensors all data are assumed to be constantly available

It is also not possible to test different kinds of ML methods or a plethora of NNswithin the time frame of the thesis For that reason the decision of developing anNN is based on its success in predictive analytics in other fields[1 2] and the frameof reference for deciding the most appropriate NN will depend on the characteristicsof the data and the future requirements of the NN

2

15 METHOD DESCRIPTION

15 Method Description

The task at hand is a combination of two On one hand it is an engineering taskinvestigating if an NN can be developed for the BWTS On the other hand it is ascientific task that focuses on applying and evaluating current research in the areasof NNs water filtration techniques time series analysis and predictive analytics tothe problem To complete both tasks within the time frame a methodology has tobe developed to ensure that the engineering task is done the research questions areanswered and that there is a clear foundation for future research The methodologyis visualized in Figure 11

The basis of the methodology starts with the problem description The problemdescription makes way for establishing the first frame of reference (FOR) and rais-ing the initial potential research questions Following that a better understandingabout the research field in question and continuous discussions with Alfa Laval willhelp adapt the FOR further Working through this process iteratively a final framecan be established where the research area is clear allowing for finalization of theresearch questions

With the frame of reference established the focus shifts to gathering informationthrough appropriate resources such as scientific articles and papers Interviews ex-isting documentation of the BWTS and future use of the NN also helps in narrowingdown the solution space to only contain relevant solutions that are optimal for thethesis With a smaller set of NNs to chose from the best suited network structurewill be developed tested and evaluated

In preparation of constructing and implementing the NN the data have to beprocessed according to the needs of the selected network Pre-processing strate-gies on how datasets are best prepared for NN processing will be investigated andexecuted to ensure that a clear methodology for future processing is executed anddocumented For correctly classifying the current clogging grade or rate of cloggingof the filter state of the art research will be used as reference when commencingwith the labelling of the data

When the classificationlabelling is done implementation through training and test-ing of the NN can begin Sequentially improvement to the NN structure will bemade by comparing the results of different initial weight conditions layer configu-rations and data partitions The experimental results will be graded in terms ofpredictive accuracy and the estimated errors of each individual parameter will betaken into consideration

Lastly the validation process can begin to ensure that the initial requirementsfrom the problem description are either met or that they have been investigatedWith the results at hand a conclusion can be presented describing how the system

3

CHAPTER 1 INTRODUCTION

can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

Figure 11 Proposed methodology for the thesis

4

Chapter 2

Frame of Reference

This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

21 Filtration amp Clogging Indicators

Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

211 Basket Filter

A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

5

CHAPTER 2 FRAME OF REFERENCE

Figure 21 An overview of a basket filter1

The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

212 Self-Cleaning Basket Filters

Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

1Source httpwwwfilter-technicsbe

6

21 FILTRATION amp CLOGGING INDICATORS

Figure 22 An overview of a basket filter with self-cleaning2

The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

213 Manometer

Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

2Source httpwwwdirectindustrycom

7

CHAPTER 2 FRAME OF REFERENCE

214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

2 a decrease in Q as a result of an increase in ∆p

These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

1 steady state ∆p and Qrarr Nolittle clogging

2 linear increase in ∆p and steady Qrarr Moderate clogging

3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

Figure 23 Visualization of the clogging states3

3Source Eker et al [6]

8

21 FILTRATION amp CLOGGING INDICATORS

215 Physics-based Modelling

The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

QL = KA

microL∆p (21)

rewritten as

∆p = microL

KAQL (22)

A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

∆p = kVsmicro

Φ2D2p

(1minus ε)2L

ε3(23)

Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

∆p = 150Vsmicro(1minus ε)2L

D2pε

3 + 175(1minus ε)ρV 2s L

ε3Dp(24)

where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

Table 21 Variable explanation for Ergunrsquos equation

Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

Dp Diameter of the spherical particle mρ Density of the liquid kgm3

9

CHAPTER 2 FRAME OF REFERENCE

Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

Table 22 Outputs of a confusion matrix

PredictionPositive Negative

Act

ual Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

ACC =sumn

i=1 jin

where ji =

1 if yi = yi

0 if yi 6= yi

(25)

by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

10

22 PREDICTIVE ANALYTICS

In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

221 Classification Error Metrics

Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

Area Under Curve (AUC)

AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

sensitivity = TP

TP + FN(26)

specificity = TN

TN + FP(27)

The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

F1 Score

The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

precision = TP

TP + FP(28)

recall = TP

TP + FN(29)

F1 = 2times precisiontimes recallprecision+ recall

(210)

11

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 2: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

AbstractSince the introduction of the Ballast Water Management Convention ballast watertreatment systems are required to be used on ships for processing the ballast waterto avoid spreading bacteria or other microbes which can destroy foreign ecosystemsOne way of pre-processing the water for treatment is by straining the water througha filtration unit When the filter mesh retains particles it begins to clog and couldpotentially clog rapidly if the concentration of particles in the water is high Theclog jeopardises the system The thesis aims at investigating if machine learningthrough neural networks can be implemented with the system to predictively deter-mine filter clogging by investigating two popular network structures for time seriesanalysis

The problem came down to initially determine different grades of clogging for the fil-ter element based on sampled sensor data from the ballast water treatment systemThe data were then put through regression analysis through two neural networksfor parameter prediction one LSTM and one CNN The LSTM predicted values ofvariable and clogging labels for the next 5 seconds and the CNN predicted values ofvariable and clogging labels for the next 30 seconds The predicted data were thenverified through classification analysis by an LSTM network and a CNN

The LSTM regression network achieved an r2-score of 0981 and the LSTM classi-fication network achieved a classification accuracy of 995 The CNN regressionnetwork achieved an r2-score of 0876 and the CNN classification network achieved aclassification accuracy of 933 The results conclude that ML can be used for iden-tifying different grades of clogging but that further research is required to determineif all clogging states can be classified

SammanfattningSedan Ballast Water Management-konventionen introducerades har fartyg behovtanvanda barlastvattensystem for att behandla barlastvattnet i ett forsok att hammaspridningen av bakterier och andra mikrober som kan vara farliga for frammandeecosystem Ett satt att forbehandla vattnet innan behandling ar genom att lata detpassera genom ett filter Nar filtret samlar pa sig partiklar borjar det att kloggaoch kan potentiellt klogga igen snabbt nar koncentrationen av partiklar i vattnet arhog Kloggning kan aventyra systemets sakerhet Det har examensarbetet amnaratt undersoka om maskininlarning genom neurala natvark kan implementeras i sys-temet for att prediktivt bestamma filtrets kloggningsgrad genom att undersokalampligheten hos tva populara natverksstrukturer for tidsserieanalys

Problemet handlade initialt om att bedomma olika kloggningsgrader for filterele-mentet baserat pa samplade sensordata fran barlastvattensystemet Datan kordessedan for regressionsanalys genom tva neurala natverk ett av typen LSTM ochett av typen CNN for att prediktivt bestamma paramterarna LSTM-natvarketuppskattade variabelvarden och kloggningsgrad for de kommande 5 sekundrarnamedan CNNet uppskattade variabelvarden och kloggningsgrad for de kommande30 sekunderna Den uppskattade datan verifierades sedan genom klassificering avett LSTM natverk och tva CNN

LSTM natverket for regression uppnadde ett r2-resultat pa 0981 och LSTM natver-ket for klassificering uppnadde en klassificeringsgrad pa 995 CNNet for regres-sion uppnadde ett r2-resultat pa 0876 och CNNet for klassificering uppnadde enklassificeringsgrad pa 933 Resultatet visar att ML kan anvandas for att identi-fiera olika kloggningsgrad men ytterligare forskning kravs for att bestamma om allakloggningsstadier kan klassificeras

Nomenclature

ARIMA Autoregressive Integrated Moving Average

AUC Area Under Curve

BWTS Ballast Water Treatment System

CNN Convolutional Neural Network

FOR Frame of Reference

LSTM Long Short Term Memory

ML Machine Learning

MAE Mean Absolute Error

MSE Mean Squared Error

NN Neural Network

ReLU Rectified Linear Unit

RMSE Root Mean Squared Error

TSS Total Suspended Solids

Contents

1 Introduction 111 Background 112 Problem Description 113 Purpose Definitions amp Research Questions 214 Scope and Delimitations 215 Method Description 3

2 Frame of Reference 521 Filtration amp Clogging Indicators 5

211 Basket Filter 5212 Self-Cleaning Basket Filters 6213 Manometer 7214 The Clogging Phenomena 8215 Physics-based Modelling 9

22 Predictive Analytics 10221 Classification Error Metrics 11222 Regression Error Metrics 12223 Stochastic Time Series Models 14

23 Neural Networks 15231 Overview 15232 The Perceptron 16233 Activation functions 16234 Neural Network Architectures 17

3 Experimental Development 2331 Data Gathering and Processing 2332 Model Generation 26

321 Regression Processing with the LSTM Model 27322 Regression Processing with the CNN Model 28323 Label Classification 29

33 Model evaluation 3034 Hardware Specifications 31

4 Results 3341 LSTM Performance 3342 CNN Performance 36

5 Discussion amp Conclusion 4151 The LSTM Network 41

511 Regression Analysis 41512 Classification Analysis 42

52 The CNN 42521 Regression Analysis 42522 Classification Analysis 43

53 Comparison Between Both Networks 4454 Conclusion 44

6 Future Work 45

Bibliography 47

Chapter 1

Introduction

11 Background

Ballast water tanks are used on ships to stabilize the ship for different shippingloads When a ship isnrsquot fully loaded or when a ship must run at a sufficient depthwater is pumped into the ballast water tanks through a water pumping system Topreserve existing ecosystems as well as preventing the spread of bacteria larvaeor other microbes ballast water management is regulated world-wide by the Inter-national Convention for the Control and Management of Shipsrsquo Ballast Water andSediments (BWM convention)

PureBallast 3 (PB3) is a ballast water treatment system (BWTS) designed by AlfaLaval that works as an extension to the shipsrsquo water pumping system The BWTSuses a filter for physical separation of organisms and total suspended solids (TSS)and a UV reactor for main treatment of the ballast water As PB3 can be installedon a variety of ships the BWTS must be able to process different waters underdifferent conditions to fulfill the requirements of the BWM convention

12 Problem Description

In the existing system there is currently no way of detecting if the filter in use isabout to clog A clogged filter forces a stop for the entire process and the onlyway to get the BWTS functional again is to physically remove the filter disas-semble it clean it reassemble it and put it back This cleaning process involvesrisks of damaging the filter is expensive to carry out and takes up unnecessary time

Furthermore due to the different concentrations of the TSS in waters around theworld as well as different supply flows and supply pressures from various shippumping system the load on the BWTS may vary greatly This imposes problemsin estimating actual system health when using a static model as there are discrep-ancies introduced in the measured parameters

1

CHAPTER 1 INTRODUCTION

These problems make it impossible to achieve optimal operability while ensuringthat the BWTS runs safely in every environment A desired solution is to developa method which can analyze and react to changes imposed to the system on thefly One such solution could be to use the large amounts of data generated by thesystem to create a neural network (NN) to estimate the state of the BWTS

13 Purpose Definitions amp Research Questions

The use of machine learning (ML) in this type of system is a new application of theotherwise popular statistical tool As there is no existing public information on howsuch an implementation can be done the focus of the thesis and the main researchquestion is

bull To investigate and evaluate the possibility of using ML for predictively esti-mating filter clogging with focus on maritime systems

An NN model will be developed and evaluated in terms of how accurately it canpredict the clogging of the filter in the BWTS using the systems sensor data Theimplementation of an NN into a system like this is the first of its kind and willrequire analysis and understanding of the system to ensure a working realizationTherefore with the BWTS in mind the original proposed research question can bespecified further to be

bull How can an NN be integrated with a BWTS to estimate clogging of a basketfilter

14 Scope and Delimitations

In comparison to the initial thesis scope provided in Appendix A some delimita-tions have been made First and foremost the only part of interest of the BWTSis the filter When designing the NN the data used will be from in-house testingas well as from the cloud service Connectivity so no alternations to the existinghardware and software can be made For the purpose of focusing on the ML modeland the system sensors all data are assumed to be constantly available

It is also not possible to test different kinds of ML methods or a plethora of NNswithin the time frame of the thesis For that reason the decision of developing anNN is based on its success in predictive analytics in other fields[1 2] and the frameof reference for deciding the most appropriate NN will depend on the characteristicsof the data and the future requirements of the NN

2

15 METHOD DESCRIPTION

15 Method Description

The task at hand is a combination of two On one hand it is an engineering taskinvestigating if an NN can be developed for the BWTS On the other hand it is ascientific task that focuses on applying and evaluating current research in the areasof NNs water filtration techniques time series analysis and predictive analytics tothe problem To complete both tasks within the time frame a methodology has tobe developed to ensure that the engineering task is done the research questions areanswered and that there is a clear foundation for future research The methodologyis visualized in Figure 11

The basis of the methodology starts with the problem description The problemdescription makes way for establishing the first frame of reference (FOR) and rais-ing the initial potential research questions Following that a better understandingabout the research field in question and continuous discussions with Alfa Laval willhelp adapt the FOR further Working through this process iteratively a final framecan be established where the research area is clear allowing for finalization of theresearch questions

With the frame of reference established the focus shifts to gathering informationthrough appropriate resources such as scientific articles and papers Interviews ex-isting documentation of the BWTS and future use of the NN also helps in narrowingdown the solution space to only contain relevant solutions that are optimal for thethesis With a smaller set of NNs to chose from the best suited network structurewill be developed tested and evaluated

In preparation of constructing and implementing the NN the data have to beprocessed according to the needs of the selected network Pre-processing strate-gies on how datasets are best prepared for NN processing will be investigated andexecuted to ensure that a clear methodology for future processing is executed anddocumented For correctly classifying the current clogging grade or rate of cloggingof the filter state of the art research will be used as reference when commencingwith the labelling of the data

When the classificationlabelling is done implementation through training and test-ing of the NN can begin Sequentially improvement to the NN structure will bemade by comparing the results of different initial weight conditions layer configu-rations and data partitions The experimental results will be graded in terms ofpredictive accuracy and the estimated errors of each individual parameter will betaken into consideration

Lastly the validation process can begin to ensure that the initial requirementsfrom the problem description are either met or that they have been investigatedWith the results at hand a conclusion can be presented describing how the system

3

CHAPTER 1 INTRODUCTION

can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

Figure 11 Proposed methodology for the thesis

4

Chapter 2

Frame of Reference

This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

21 Filtration amp Clogging Indicators

Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

211 Basket Filter

A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

5

CHAPTER 2 FRAME OF REFERENCE

Figure 21 An overview of a basket filter1

The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

212 Self-Cleaning Basket Filters

Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

1Source httpwwwfilter-technicsbe

6

21 FILTRATION amp CLOGGING INDICATORS

Figure 22 An overview of a basket filter with self-cleaning2

The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

213 Manometer

Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

2Source httpwwwdirectindustrycom

7

CHAPTER 2 FRAME OF REFERENCE

214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

2 a decrease in Q as a result of an increase in ∆p

These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

1 steady state ∆p and Qrarr Nolittle clogging

2 linear increase in ∆p and steady Qrarr Moderate clogging

3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

Figure 23 Visualization of the clogging states3

3Source Eker et al [6]

8

21 FILTRATION amp CLOGGING INDICATORS

215 Physics-based Modelling

The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

QL = KA

microL∆p (21)

rewritten as

∆p = microL

KAQL (22)

A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

∆p = kVsmicro

Φ2D2p

(1minus ε)2L

ε3(23)

Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

∆p = 150Vsmicro(1minus ε)2L

D2pε

3 + 175(1minus ε)ρV 2s L

ε3Dp(24)

where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

Table 21 Variable explanation for Ergunrsquos equation

Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

Dp Diameter of the spherical particle mρ Density of the liquid kgm3

9

CHAPTER 2 FRAME OF REFERENCE

Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

Table 22 Outputs of a confusion matrix

PredictionPositive Negative

Act

ual Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

ACC =sumn

i=1 jin

where ji =

1 if yi = yi

0 if yi 6= yi

(25)

by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

10

22 PREDICTIVE ANALYTICS

In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

221 Classification Error Metrics

Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

Area Under Curve (AUC)

AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

sensitivity = TP

TP + FN(26)

specificity = TN

TN + FP(27)

The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

F1 Score

The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

precision = TP

TP + FP(28)

recall = TP

TP + FN(29)

F1 = 2times precisiontimes recallprecision+ recall

(210)

11

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 3: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

SammanfattningSedan Ballast Water Management-konventionen introducerades har fartyg behovtanvanda barlastvattensystem for att behandla barlastvattnet i ett forsok att hammaspridningen av bakterier och andra mikrober som kan vara farliga for frammandeecosystem Ett satt att forbehandla vattnet innan behandling ar genom att lata detpassera genom ett filter Nar filtret samlar pa sig partiklar borjar det att kloggaoch kan potentiellt klogga igen snabbt nar koncentrationen av partiklar i vattnet arhog Kloggning kan aventyra systemets sakerhet Det har examensarbetet amnaratt undersoka om maskininlarning genom neurala natvark kan implementeras i sys-temet for att prediktivt bestamma filtrets kloggningsgrad genom att undersokalampligheten hos tva populara natverksstrukturer for tidsserieanalys

Problemet handlade initialt om att bedomma olika kloggningsgrader for filterele-mentet baserat pa samplade sensordata fran barlastvattensystemet Datan kordessedan for regressionsanalys genom tva neurala natverk ett av typen LSTM ochett av typen CNN for att prediktivt bestamma paramterarna LSTM-natvarketuppskattade variabelvarden och kloggningsgrad for de kommande 5 sekundrarnamedan CNNet uppskattade variabelvarden och kloggningsgrad for de kommande30 sekunderna Den uppskattade datan verifierades sedan genom klassificering avett LSTM natverk och tva CNN

LSTM natverket for regression uppnadde ett r2-resultat pa 0981 och LSTM natver-ket for klassificering uppnadde en klassificeringsgrad pa 995 CNNet for regres-sion uppnadde ett r2-resultat pa 0876 och CNNet for klassificering uppnadde enklassificeringsgrad pa 933 Resultatet visar att ML kan anvandas for att identi-fiera olika kloggningsgrad men ytterligare forskning kravs for att bestamma om allakloggningsstadier kan klassificeras

Nomenclature

ARIMA Autoregressive Integrated Moving Average

AUC Area Under Curve

BWTS Ballast Water Treatment System

CNN Convolutional Neural Network

FOR Frame of Reference

LSTM Long Short Term Memory

ML Machine Learning

MAE Mean Absolute Error

MSE Mean Squared Error

NN Neural Network

ReLU Rectified Linear Unit

RMSE Root Mean Squared Error

TSS Total Suspended Solids

Contents

1 Introduction 111 Background 112 Problem Description 113 Purpose Definitions amp Research Questions 214 Scope and Delimitations 215 Method Description 3

2 Frame of Reference 521 Filtration amp Clogging Indicators 5

211 Basket Filter 5212 Self-Cleaning Basket Filters 6213 Manometer 7214 The Clogging Phenomena 8215 Physics-based Modelling 9

22 Predictive Analytics 10221 Classification Error Metrics 11222 Regression Error Metrics 12223 Stochastic Time Series Models 14

23 Neural Networks 15231 Overview 15232 The Perceptron 16233 Activation functions 16234 Neural Network Architectures 17

3 Experimental Development 2331 Data Gathering and Processing 2332 Model Generation 26

321 Regression Processing with the LSTM Model 27322 Regression Processing with the CNN Model 28323 Label Classification 29

33 Model evaluation 3034 Hardware Specifications 31

4 Results 3341 LSTM Performance 3342 CNN Performance 36

5 Discussion amp Conclusion 4151 The LSTM Network 41

511 Regression Analysis 41512 Classification Analysis 42

52 The CNN 42521 Regression Analysis 42522 Classification Analysis 43

53 Comparison Between Both Networks 4454 Conclusion 44

6 Future Work 45

Bibliography 47

Chapter 1

Introduction

11 Background

Ballast water tanks are used on ships to stabilize the ship for different shippingloads When a ship isnrsquot fully loaded or when a ship must run at a sufficient depthwater is pumped into the ballast water tanks through a water pumping system Topreserve existing ecosystems as well as preventing the spread of bacteria larvaeor other microbes ballast water management is regulated world-wide by the Inter-national Convention for the Control and Management of Shipsrsquo Ballast Water andSediments (BWM convention)

PureBallast 3 (PB3) is a ballast water treatment system (BWTS) designed by AlfaLaval that works as an extension to the shipsrsquo water pumping system The BWTSuses a filter for physical separation of organisms and total suspended solids (TSS)and a UV reactor for main treatment of the ballast water As PB3 can be installedon a variety of ships the BWTS must be able to process different waters underdifferent conditions to fulfill the requirements of the BWM convention

12 Problem Description

In the existing system there is currently no way of detecting if the filter in use isabout to clog A clogged filter forces a stop for the entire process and the onlyway to get the BWTS functional again is to physically remove the filter disas-semble it clean it reassemble it and put it back This cleaning process involvesrisks of damaging the filter is expensive to carry out and takes up unnecessary time

Furthermore due to the different concentrations of the TSS in waters around theworld as well as different supply flows and supply pressures from various shippumping system the load on the BWTS may vary greatly This imposes problemsin estimating actual system health when using a static model as there are discrep-ancies introduced in the measured parameters

1

CHAPTER 1 INTRODUCTION

These problems make it impossible to achieve optimal operability while ensuringthat the BWTS runs safely in every environment A desired solution is to developa method which can analyze and react to changes imposed to the system on thefly One such solution could be to use the large amounts of data generated by thesystem to create a neural network (NN) to estimate the state of the BWTS

13 Purpose Definitions amp Research Questions

The use of machine learning (ML) in this type of system is a new application of theotherwise popular statistical tool As there is no existing public information on howsuch an implementation can be done the focus of the thesis and the main researchquestion is

bull To investigate and evaluate the possibility of using ML for predictively esti-mating filter clogging with focus on maritime systems

An NN model will be developed and evaluated in terms of how accurately it canpredict the clogging of the filter in the BWTS using the systems sensor data Theimplementation of an NN into a system like this is the first of its kind and willrequire analysis and understanding of the system to ensure a working realizationTherefore with the BWTS in mind the original proposed research question can bespecified further to be

bull How can an NN be integrated with a BWTS to estimate clogging of a basketfilter

14 Scope and Delimitations

In comparison to the initial thesis scope provided in Appendix A some delimita-tions have been made First and foremost the only part of interest of the BWTSis the filter When designing the NN the data used will be from in-house testingas well as from the cloud service Connectivity so no alternations to the existinghardware and software can be made For the purpose of focusing on the ML modeland the system sensors all data are assumed to be constantly available

It is also not possible to test different kinds of ML methods or a plethora of NNswithin the time frame of the thesis For that reason the decision of developing anNN is based on its success in predictive analytics in other fields[1 2] and the frameof reference for deciding the most appropriate NN will depend on the characteristicsof the data and the future requirements of the NN

2

15 METHOD DESCRIPTION

15 Method Description

The task at hand is a combination of two On one hand it is an engineering taskinvestigating if an NN can be developed for the BWTS On the other hand it is ascientific task that focuses on applying and evaluating current research in the areasof NNs water filtration techniques time series analysis and predictive analytics tothe problem To complete both tasks within the time frame a methodology has tobe developed to ensure that the engineering task is done the research questions areanswered and that there is a clear foundation for future research The methodologyis visualized in Figure 11

The basis of the methodology starts with the problem description The problemdescription makes way for establishing the first frame of reference (FOR) and rais-ing the initial potential research questions Following that a better understandingabout the research field in question and continuous discussions with Alfa Laval willhelp adapt the FOR further Working through this process iteratively a final framecan be established where the research area is clear allowing for finalization of theresearch questions

With the frame of reference established the focus shifts to gathering informationthrough appropriate resources such as scientific articles and papers Interviews ex-isting documentation of the BWTS and future use of the NN also helps in narrowingdown the solution space to only contain relevant solutions that are optimal for thethesis With a smaller set of NNs to chose from the best suited network structurewill be developed tested and evaluated

In preparation of constructing and implementing the NN the data have to beprocessed according to the needs of the selected network Pre-processing strate-gies on how datasets are best prepared for NN processing will be investigated andexecuted to ensure that a clear methodology for future processing is executed anddocumented For correctly classifying the current clogging grade or rate of cloggingof the filter state of the art research will be used as reference when commencingwith the labelling of the data

When the classificationlabelling is done implementation through training and test-ing of the NN can begin Sequentially improvement to the NN structure will bemade by comparing the results of different initial weight conditions layer configu-rations and data partitions The experimental results will be graded in terms ofpredictive accuracy and the estimated errors of each individual parameter will betaken into consideration

Lastly the validation process can begin to ensure that the initial requirementsfrom the problem description are either met or that they have been investigatedWith the results at hand a conclusion can be presented describing how the system

3

CHAPTER 1 INTRODUCTION

can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

Figure 11 Proposed methodology for the thesis

4

Chapter 2

Frame of Reference

This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

21 Filtration amp Clogging Indicators

Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

211 Basket Filter

A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

5

CHAPTER 2 FRAME OF REFERENCE

Figure 21 An overview of a basket filter1

The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

212 Self-Cleaning Basket Filters

Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

1Source httpwwwfilter-technicsbe

6

21 FILTRATION amp CLOGGING INDICATORS

Figure 22 An overview of a basket filter with self-cleaning2

The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

213 Manometer

Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

2Source httpwwwdirectindustrycom

7

CHAPTER 2 FRAME OF REFERENCE

214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

2 a decrease in Q as a result of an increase in ∆p

These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

1 steady state ∆p and Qrarr Nolittle clogging

2 linear increase in ∆p and steady Qrarr Moderate clogging

3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

Figure 23 Visualization of the clogging states3

3Source Eker et al [6]

8

21 FILTRATION amp CLOGGING INDICATORS

215 Physics-based Modelling

The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

QL = KA

microL∆p (21)

rewritten as

∆p = microL

KAQL (22)

A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

∆p = kVsmicro

Φ2D2p

(1minus ε)2L

ε3(23)

Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

∆p = 150Vsmicro(1minus ε)2L

D2pε

3 + 175(1minus ε)ρV 2s L

ε3Dp(24)

where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

Table 21 Variable explanation for Ergunrsquos equation

Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

Dp Diameter of the spherical particle mρ Density of the liquid kgm3

9

CHAPTER 2 FRAME OF REFERENCE

Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

Table 22 Outputs of a confusion matrix

PredictionPositive Negative

Act

ual Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

ACC =sumn

i=1 jin

where ji =

1 if yi = yi

0 if yi 6= yi

(25)

by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

10

22 PREDICTIVE ANALYTICS

In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

221 Classification Error Metrics

Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

Area Under Curve (AUC)

AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

sensitivity = TP

TP + FN(26)

specificity = TN

TN + FP(27)

The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

F1 Score

The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

precision = TP

TP + FP(28)

recall = TP

TP + FN(29)

F1 = 2times precisiontimes recallprecision+ recall

(210)

11

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 4: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

Nomenclature

ARIMA Autoregressive Integrated Moving Average

AUC Area Under Curve

BWTS Ballast Water Treatment System

CNN Convolutional Neural Network

FOR Frame of Reference

LSTM Long Short Term Memory

ML Machine Learning

MAE Mean Absolute Error

MSE Mean Squared Error

NN Neural Network

ReLU Rectified Linear Unit

RMSE Root Mean Squared Error

TSS Total Suspended Solids

Contents

1 Introduction 111 Background 112 Problem Description 113 Purpose Definitions amp Research Questions 214 Scope and Delimitations 215 Method Description 3

2 Frame of Reference 521 Filtration amp Clogging Indicators 5

211 Basket Filter 5212 Self-Cleaning Basket Filters 6213 Manometer 7214 The Clogging Phenomena 8215 Physics-based Modelling 9

22 Predictive Analytics 10221 Classification Error Metrics 11222 Regression Error Metrics 12223 Stochastic Time Series Models 14

23 Neural Networks 15231 Overview 15232 The Perceptron 16233 Activation functions 16234 Neural Network Architectures 17

3 Experimental Development 2331 Data Gathering and Processing 2332 Model Generation 26

321 Regression Processing with the LSTM Model 27322 Regression Processing with the CNN Model 28323 Label Classification 29

33 Model evaluation 3034 Hardware Specifications 31

4 Results 3341 LSTM Performance 3342 CNN Performance 36

5 Discussion amp Conclusion 4151 The LSTM Network 41

511 Regression Analysis 41512 Classification Analysis 42

52 The CNN 42521 Regression Analysis 42522 Classification Analysis 43

53 Comparison Between Both Networks 4454 Conclusion 44

6 Future Work 45

Bibliography 47

Chapter 1

Introduction

11 Background

Ballast water tanks are used on ships to stabilize the ship for different shippingloads When a ship isnrsquot fully loaded or when a ship must run at a sufficient depthwater is pumped into the ballast water tanks through a water pumping system Topreserve existing ecosystems as well as preventing the spread of bacteria larvaeor other microbes ballast water management is regulated world-wide by the Inter-national Convention for the Control and Management of Shipsrsquo Ballast Water andSediments (BWM convention)

PureBallast 3 (PB3) is a ballast water treatment system (BWTS) designed by AlfaLaval that works as an extension to the shipsrsquo water pumping system The BWTSuses a filter for physical separation of organisms and total suspended solids (TSS)and a UV reactor for main treatment of the ballast water As PB3 can be installedon a variety of ships the BWTS must be able to process different waters underdifferent conditions to fulfill the requirements of the BWM convention

12 Problem Description

In the existing system there is currently no way of detecting if the filter in use isabout to clog A clogged filter forces a stop for the entire process and the onlyway to get the BWTS functional again is to physically remove the filter disas-semble it clean it reassemble it and put it back This cleaning process involvesrisks of damaging the filter is expensive to carry out and takes up unnecessary time

Furthermore due to the different concentrations of the TSS in waters around theworld as well as different supply flows and supply pressures from various shippumping system the load on the BWTS may vary greatly This imposes problemsin estimating actual system health when using a static model as there are discrep-ancies introduced in the measured parameters

1

CHAPTER 1 INTRODUCTION

These problems make it impossible to achieve optimal operability while ensuringthat the BWTS runs safely in every environment A desired solution is to developa method which can analyze and react to changes imposed to the system on thefly One such solution could be to use the large amounts of data generated by thesystem to create a neural network (NN) to estimate the state of the BWTS

13 Purpose Definitions amp Research Questions

The use of machine learning (ML) in this type of system is a new application of theotherwise popular statistical tool As there is no existing public information on howsuch an implementation can be done the focus of the thesis and the main researchquestion is

bull To investigate and evaluate the possibility of using ML for predictively esti-mating filter clogging with focus on maritime systems

An NN model will be developed and evaluated in terms of how accurately it canpredict the clogging of the filter in the BWTS using the systems sensor data Theimplementation of an NN into a system like this is the first of its kind and willrequire analysis and understanding of the system to ensure a working realizationTherefore with the BWTS in mind the original proposed research question can bespecified further to be

bull How can an NN be integrated with a BWTS to estimate clogging of a basketfilter

14 Scope and Delimitations

In comparison to the initial thesis scope provided in Appendix A some delimita-tions have been made First and foremost the only part of interest of the BWTSis the filter When designing the NN the data used will be from in-house testingas well as from the cloud service Connectivity so no alternations to the existinghardware and software can be made For the purpose of focusing on the ML modeland the system sensors all data are assumed to be constantly available

It is also not possible to test different kinds of ML methods or a plethora of NNswithin the time frame of the thesis For that reason the decision of developing anNN is based on its success in predictive analytics in other fields[1 2] and the frameof reference for deciding the most appropriate NN will depend on the characteristicsof the data and the future requirements of the NN

2

15 METHOD DESCRIPTION

15 Method Description

The task at hand is a combination of two On one hand it is an engineering taskinvestigating if an NN can be developed for the BWTS On the other hand it is ascientific task that focuses on applying and evaluating current research in the areasof NNs water filtration techniques time series analysis and predictive analytics tothe problem To complete both tasks within the time frame a methodology has tobe developed to ensure that the engineering task is done the research questions areanswered and that there is a clear foundation for future research The methodologyis visualized in Figure 11

The basis of the methodology starts with the problem description The problemdescription makes way for establishing the first frame of reference (FOR) and rais-ing the initial potential research questions Following that a better understandingabout the research field in question and continuous discussions with Alfa Laval willhelp adapt the FOR further Working through this process iteratively a final framecan be established where the research area is clear allowing for finalization of theresearch questions

With the frame of reference established the focus shifts to gathering informationthrough appropriate resources such as scientific articles and papers Interviews ex-isting documentation of the BWTS and future use of the NN also helps in narrowingdown the solution space to only contain relevant solutions that are optimal for thethesis With a smaller set of NNs to chose from the best suited network structurewill be developed tested and evaluated

In preparation of constructing and implementing the NN the data have to beprocessed according to the needs of the selected network Pre-processing strate-gies on how datasets are best prepared for NN processing will be investigated andexecuted to ensure that a clear methodology for future processing is executed anddocumented For correctly classifying the current clogging grade or rate of cloggingof the filter state of the art research will be used as reference when commencingwith the labelling of the data

When the classificationlabelling is done implementation through training and test-ing of the NN can begin Sequentially improvement to the NN structure will bemade by comparing the results of different initial weight conditions layer configu-rations and data partitions The experimental results will be graded in terms ofpredictive accuracy and the estimated errors of each individual parameter will betaken into consideration

Lastly the validation process can begin to ensure that the initial requirementsfrom the problem description are either met or that they have been investigatedWith the results at hand a conclusion can be presented describing how the system

3

CHAPTER 1 INTRODUCTION

can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

Figure 11 Proposed methodology for the thesis

4

Chapter 2

Frame of Reference

This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

21 Filtration amp Clogging Indicators

Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

211 Basket Filter

A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

5

CHAPTER 2 FRAME OF REFERENCE

Figure 21 An overview of a basket filter1

The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

212 Self-Cleaning Basket Filters

Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

1Source httpwwwfilter-technicsbe

6

21 FILTRATION amp CLOGGING INDICATORS

Figure 22 An overview of a basket filter with self-cleaning2

The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

213 Manometer

Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

2Source httpwwwdirectindustrycom

7

CHAPTER 2 FRAME OF REFERENCE

214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

2 a decrease in Q as a result of an increase in ∆p

These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

1 steady state ∆p and Qrarr Nolittle clogging

2 linear increase in ∆p and steady Qrarr Moderate clogging

3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

Figure 23 Visualization of the clogging states3

3Source Eker et al [6]

8

21 FILTRATION amp CLOGGING INDICATORS

215 Physics-based Modelling

The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

QL = KA

microL∆p (21)

rewritten as

∆p = microL

KAQL (22)

A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

∆p = kVsmicro

Φ2D2p

(1minus ε)2L

ε3(23)

Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

∆p = 150Vsmicro(1minus ε)2L

D2pε

3 + 175(1minus ε)ρV 2s L

ε3Dp(24)

where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

Table 21 Variable explanation for Ergunrsquos equation

Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

Dp Diameter of the spherical particle mρ Density of the liquid kgm3

9

CHAPTER 2 FRAME OF REFERENCE

Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

Table 22 Outputs of a confusion matrix

PredictionPositive Negative

Act

ual Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

ACC =sumn

i=1 jin

where ji =

1 if yi = yi

0 if yi 6= yi

(25)

by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

10

22 PREDICTIVE ANALYTICS

In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

221 Classification Error Metrics

Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

Area Under Curve (AUC)

AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

sensitivity = TP

TP + FN(26)

specificity = TN

TN + FP(27)

The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

F1 Score

The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

precision = TP

TP + FP(28)

recall = TP

TP + FN(29)

F1 = 2times precisiontimes recallprecision+ recall

(210)

11

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 5: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

Contents

1 Introduction 111 Background 112 Problem Description 113 Purpose Definitions amp Research Questions 214 Scope and Delimitations 215 Method Description 3

2 Frame of Reference 521 Filtration amp Clogging Indicators 5

211 Basket Filter 5212 Self-Cleaning Basket Filters 6213 Manometer 7214 The Clogging Phenomena 8215 Physics-based Modelling 9

22 Predictive Analytics 10221 Classification Error Metrics 11222 Regression Error Metrics 12223 Stochastic Time Series Models 14

23 Neural Networks 15231 Overview 15232 The Perceptron 16233 Activation functions 16234 Neural Network Architectures 17

3 Experimental Development 2331 Data Gathering and Processing 2332 Model Generation 26

321 Regression Processing with the LSTM Model 27322 Regression Processing with the CNN Model 28323 Label Classification 29

33 Model evaluation 3034 Hardware Specifications 31

4 Results 3341 LSTM Performance 3342 CNN Performance 36

5 Discussion amp Conclusion 4151 The LSTM Network 41

511 Regression Analysis 41512 Classification Analysis 42

52 The CNN 42521 Regression Analysis 42522 Classification Analysis 43

53 Comparison Between Both Networks 4454 Conclusion 44

6 Future Work 45

Bibliography 47

Chapter 1

Introduction

11 Background

Ballast water tanks are used on ships to stabilize the ship for different shippingloads When a ship isnrsquot fully loaded or when a ship must run at a sufficient depthwater is pumped into the ballast water tanks through a water pumping system Topreserve existing ecosystems as well as preventing the spread of bacteria larvaeor other microbes ballast water management is regulated world-wide by the Inter-national Convention for the Control and Management of Shipsrsquo Ballast Water andSediments (BWM convention)

PureBallast 3 (PB3) is a ballast water treatment system (BWTS) designed by AlfaLaval that works as an extension to the shipsrsquo water pumping system The BWTSuses a filter for physical separation of organisms and total suspended solids (TSS)and a UV reactor for main treatment of the ballast water As PB3 can be installedon a variety of ships the BWTS must be able to process different waters underdifferent conditions to fulfill the requirements of the BWM convention

12 Problem Description

In the existing system there is currently no way of detecting if the filter in use isabout to clog A clogged filter forces a stop for the entire process and the onlyway to get the BWTS functional again is to physically remove the filter disas-semble it clean it reassemble it and put it back This cleaning process involvesrisks of damaging the filter is expensive to carry out and takes up unnecessary time

Furthermore due to the different concentrations of the TSS in waters around theworld as well as different supply flows and supply pressures from various shippumping system the load on the BWTS may vary greatly This imposes problemsin estimating actual system health when using a static model as there are discrep-ancies introduced in the measured parameters

1

CHAPTER 1 INTRODUCTION

These problems make it impossible to achieve optimal operability while ensuringthat the BWTS runs safely in every environment A desired solution is to developa method which can analyze and react to changes imposed to the system on thefly One such solution could be to use the large amounts of data generated by thesystem to create a neural network (NN) to estimate the state of the BWTS

13 Purpose Definitions amp Research Questions

The use of machine learning (ML) in this type of system is a new application of theotherwise popular statistical tool As there is no existing public information on howsuch an implementation can be done the focus of the thesis and the main researchquestion is

bull To investigate and evaluate the possibility of using ML for predictively esti-mating filter clogging with focus on maritime systems

An NN model will be developed and evaluated in terms of how accurately it canpredict the clogging of the filter in the BWTS using the systems sensor data Theimplementation of an NN into a system like this is the first of its kind and willrequire analysis and understanding of the system to ensure a working realizationTherefore with the BWTS in mind the original proposed research question can bespecified further to be

bull How can an NN be integrated with a BWTS to estimate clogging of a basketfilter

14 Scope and Delimitations

In comparison to the initial thesis scope provided in Appendix A some delimita-tions have been made First and foremost the only part of interest of the BWTSis the filter When designing the NN the data used will be from in-house testingas well as from the cloud service Connectivity so no alternations to the existinghardware and software can be made For the purpose of focusing on the ML modeland the system sensors all data are assumed to be constantly available

It is also not possible to test different kinds of ML methods or a plethora of NNswithin the time frame of the thesis For that reason the decision of developing anNN is based on its success in predictive analytics in other fields[1 2] and the frameof reference for deciding the most appropriate NN will depend on the characteristicsof the data and the future requirements of the NN

2

15 METHOD DESCRIPTION

15 Method Description

The task at hand is a combination of two On one hand it is an engineering taskinvestigating if an NN can be developed for the BWTS On the other hand it is ascientific task that focuses on applying and evaluating current research in the areasof NNs water filtration techniques time series analysis and predictive analytics tothe problem To complete both tasks within the time frame a methodology has tobe developed to ensure that the engineering task is done the research questions areanswered and that there is a clear foundation for future research The methodologyis visualized in Figure 11

The basis of the methodology starts with the problem description The problemdescription makes way for establishing the first frame of reference (FOR) and rais-ing the initial potential research questions Following that a better understandingabout the research field in question and continuous discussions with Alfa Laval willhelp adapt the FOR further Working through this process iteratively a final framecan be established where the research area is clear allowing for finalization of theresearch questions

With the frame of reference established the focus shifts to gathering informationthrough appropriate resources such as scientific articles and papers Interviews ex-isting documentation of the BWTS and future use of the NN also helps in narrowingdown the solution space to only contain relevant solutions that are optimal for thethesis With a smaller set of NNs to chose from the best suited network structurewill be developed tested and evaluated

In preparation of constructing and implementing the NN the data have to beprocessed according to the needs of the selected network Pre-processing strate-gies on how datasets are best prepared for NN processing will be investigated andexecuted to ensure that a clear methodology for future processing is executed anddocumented For correctly classifying the current clogging grade or rate of cloggingof the filter state of the art research will be used as reference when commencingwith the labelling of the data

When the classificationlabelling is done implementation through training and test-ing of the NN can begin Sequentially improvement to the NN structure will bemade by comparing the results of different initial weight conditions layer configu-rations and data partitions The experimental results will be graded in terms ofpredictive accuracy and the estimated errors of each individual parameter will betaken into consideration

Lastly the validation process can begin to ensure that the initial requirementsfrom the problem description are either met or that they have been investigatedWith the results at hand a conclusion can be presented describing how the system

3

CHAPTER 1 INTRODUCTION

can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

Figure 11 Proposed methodology for the thesis

4

Chapter 2

Frame of Reference

This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

21 Filtration amp Clogging Indicators

Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

211 Basket Filter

A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

5

CHAPTER 2 FRAME OF REFERENCE

Figure 21 An overview of a basket filter1

The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

212 Self-Cleaning Basket Filters

Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

1Source httpwwwfilter-technicsbe

6

21 FILTRATION amp CLOGGING INDICATORS

Figure 22 An overview of a basket filter with self-cleaning2

The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

213 Manometer

Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

2Source httpwwwdirectindustrycom

7

CHAPTER 2 FRAME OF REFERENCE

214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

2 a decrease in Q as a result of an increase in ∆p

These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

1 steady state ∆p and Qrarr Nolittle clogging

2 linear increase in ∆p and steady Qrarr Moderate clogging

3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

Figure 23 Visualization of the clogging states3

3Source Eker et al [6]

8

21 FILTRATION amp CLOGGING INDICATORS

215 Physics-based Modelling

The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

QL = KA

microL∆p (21)

rewritten as

∆p = microL

KAQL (22)

A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

∆p = kVsmicro

Φ2D2p

(1minus ε)2L

ε3(23)

Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

∆p = 150Vsmicro(1minus ε)2L

D2pε

3 + 175(1minus ε)ρV 2s L

ε3Dp(24)

where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

Table 21 Variable explanation for Ergunrsquos equation

Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

Dp Diameter of the spherical particle mρ Density of the liquid kgm3

9

CHAPTER 2 FRAME OF REFERENCE

Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

Table 22 Outputs of a confusion matrix

PredictionPositive Negative

Act

ual Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

ACC =sumn

i=1 jin

where ji =

1 if yi = yi

0 if yi 6= yi

(25)

by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

10

22 PREDICTIVE ANALYTICS

In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

221 Classification Error Metrics

Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

Area Under Curve (AUC)

AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

sensitivity = TP

TP + FN(26)

specificity = TN

TN + FP(27)

The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

F1 Score

The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

precision = TP

TP + FP(28)

recall = TP

TP + FN(29)

F1 = 2times precisiontimes recallprecision+ recall

(210)

11

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 6: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

4 Results 3341 LSTM Performance 3342 CNN Performance 36

5 Discussion amp Conclusion 4151 The LSTM Network 41

511 Regression Analysis 41512 Classification Analysis 42

52 The CNN 42521 Regression Analysis 42522 Classification Analysis 43

53 Comparison Between Both Networks 4454 Conclusion 44

6 Future Work 45

Bibliography 47

Chapter 1

Introduction

11 Background

Ballast water tanks are used on ships to stabilize the ship for different shippingloads When a ship isnrsquot fully loaded or when a ship must run at a sufficient depthwater is pumped into the ballast water tanks through a water pumping system Topreserve existing ecosystems as well as preventing the spread of bacteria larvaeor other microbes ballast water management is regulated world-wide by the Inter-national Convention for the Control and Management of Shipsrsquo Ballast Water andSediments (BWM convention)

PureBallast 3 (PB3) is a ballast water treatment system (BWTS) designed by AlfaLaval that works as an extension to the shipsrsquo water pumping system The BWTSuses a filter for physical separation of organisms and total suspended solids (TSS)and a UV reactor for main treatment of the ballast water As PB3 can be installedon a variety of ships the BWTS must be able to process different waters underdifferent conditions to fulfill the requirements of the BWM convention

12 Problem Description

In the existing system there is currently no way of detecting if the filter in use isabout to clog A clogged filter forces a stop for the entire process and the onlyway to get the BWTS functional again is to physically remove the filter disas-semble it clean it reassemble it and put it back This cleaning process involvesrisks of damaging the filter is expensive to carry out and takes up unnecessary time

Furthermore due to the different concentrations of the TSS in waters around theworld as well as different supply flows and supply pressures from various shippumping system the load on the BWTS may vary greatly This imposes problemsin estimating actual system health when using a static model as there are discrep-ancies introduced in the measured parameters

1

CHAPTER 1 INTRODUCTION

These problems make it impossible to achieve optimal operability while ensuringthat the BWTS runs safely in every environment A desired solution is to developa method which can analyze and react to changes imposed to the system on thefly One such solution could be to use the large amounts of data generated by thesystem to create a neural network (NN) to estimate the state of the BWTS

13 Purpose Definitions amp Research Questions

The use of machine learning (ML) in this type of system is a new application of theotherwise popular statistical tool As there is no existing public information on howsuch an implementation can be done the focus of the thesis and the main researchquestion is

bull To investigate and evaluate the possibility of using ML for predictively esti-mating filter clogging with focus on maritime systems

An NN model will be developed and evaluated in terms of how accurately it canpredict the clogging of the filter in the BWTS using the systems sensor data Theimplementation of an NN into a system like this is the first of its kind and willrequire analysis and understanding of the system to ensure a working realizationTherefore with the BWTS in mind the original proposed research question can bespecified further to be

bull How can an NN be integrated with a BWTS to estimate clogging of a basketfilter

14 Scope and Delimitations

In comparison to the initial thesis scope provided in Appendix A some delimita-tions have been made First and foremost the only part of interest of the BWTSis the filter When designing the NN the data used will be from in-house testingas well as from the cloud service Connectivity so no alternations to the existinghardware and software can be made For the purpose of focusing on the ML modeland the system sensors all data are assumed to be constantly available

It is also not possible to test different kinds of ML methods or a plethora of NNswithin the time frame of the thesis For that reason the decision of developing anNN is based on its success in predictive analytics in other fields[1 2] and the frameof reference for deciding the most appropriate NN will depend on the characteristicsof the data and the future requirements of the NN

2

15 METHOD DESCRIPTION

15 Method Description

The task at hand is a combination of two On one hand it is an engineering taskinvestigating if an NN can be developed for the BWTS On the other hand it is ascientific task that focuses on applying and evaluating current research in the areasof NNs water filtration techniques time series analysis and predictive analytics tothe problem To complete both tasks within the time frame a methodology has tobe developed to ensure that the engineering task is done the research questions areanswered and that there is a clear foundation for future research The methodologyis visualized in Figure 11

The basis of the methodology starts with the problem description The problemdescription makes way for establishing the first frame of reference (FOR) and rais-ing the initial potential research questions Following that a better understandingabout the research field in question and continuous discussions with Alfa Laval willhelp adapt the FOR further Working through this process iteratively a final framecan be established where the research area is clear allowing for finalization of theresearch questions

With the frame of reference established the focus shifts to gathering informationthrough appropriate resources such as scientific articles and papers Interviews ex-isting documentation of the BWTS and future use of the NN also helps in narrowingdown the solution space to only contain relevant solutions that are optimal for thethesis With a smaller set of NNs to chose from the best suited network structurewill be developed tested and evaluated

In preparation of constructing and implementing the NN the data have to beprocessed according to the needs of the selected network Pre-processing strate-gies on how datasets are best prepared for NN processing will be investigated andexecuted to ensure that a clear methodology for future processing is executed anddocumented For correctly classifying the current clogging grade or rate of cloggingof the filter state of the art research will be used as reference when commencingwith the labelling of the data

When the classificationlabelling is done implementation through training and test-ing of the NN can begin Sequentially improvement to the NN structure will bemade by comparing the results of different initial weight conditions layer configu-rations and data partitions The experimental results will be graded in terms ofpredictive accuracy and the estimated errors of each individual parameter will betaken into consideration

Lastly the validation process can begin to ensure that the initial requirementsfrom the problem description are either met or that they have been investigatedWith the results at hand a conclusion can be presented describing how the system

3

CHAPTER 1 INTRODUCTION

can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

Figure 11 Proposed methodology for the thesis

4

Chapter 2

Frame of Reference

This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

21 Filtration amp Clogging Indicators

Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

211 Basket Filter

A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

5

CHAPTER 2 FRAME OF REFERENCE

Figure 21 An overview of a basket filter1

The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

212 Self-Cleaning Basket Filters

Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

1Source httpwwwfilter-technicsbe

6

21 FILTRATION amp CLOGGING INDICATORS

Figure 22 An overview of a basket filter with self-cleaning2

The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

213 Manometer

Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

2Source httpwwwdirectindustrycom

7

CHAPTER 2 FRAME OF REFERENCE

214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

2 a decrease in Q as a result of an increase in ∆p

These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

1 steady state ∆p and Qrarr Nolittle clogging

2 linear increase in ∆p and steady Qrarr Moderate clogging

3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

Figure 23 Visualization of the clogging states3

3Source Eker et al [6]

8

21 FILTRATION amp CLOGGING INDICATORS

215 Physics-based Modelling

The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

QL = KA

microL∆p (21)

rewritten as

∆p = microL

KAQL (22)

A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

∆p = kVsmicro

Φ2D2p

(1minus ε)2L

ε3(23)

Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

∆p = 150Vsmicro(1minus ε)2L

D2pε

3 + 175(1minus ε)ρV 2s L

ε3Dp(24)

where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

Table 21 Variable explanation for Ergunrsquos equation

Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

Dp Diameter of the spherical particle mρ Density of the liquid kgm3

9

CHAPTER 2 FRAME OF REFERENCE

Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

Table 22 Outputs of a confusion matrix

PredictionPositive Negative

Act

ual Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

ACC =sumn

i=1 jin

where ji =

1 if yi = yi

0 if yi 6= yi

(25)

by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

10

22 PREDICTIVE ANALYTICS

In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

221 Classification Error Metrics

Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

Area Under Curve (AUC)

AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

sensitivity = TP

TP + FN(26)

specificity = TN

TN + FP(27)

The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

F1 Score

The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

precision = TP

TP + FP(28)

recall = TP

TP + FN(29)

F1 = 2times precisiontimes recallprecision+ recall

(210)

11

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 7: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

Chapter 1

Introduction

11 Background

Ballast water tanks are used on ships to stabilize the ship for different shippingloads When a ship isnrsquot fully loaded or when a ship must run at a sufficient depthwater is pumped into the ballast water tanks through a water pumping system Topreserve existing ecosystems as well as preventing the spread of bacteria larvaeor other microbes ballast water management is regulated world-wide by the Inter-national Convention for the Control and Management of Shipsrsquo Ballast Water andSediments (BWM convention)

PureBallast 3 (PB3) is a ballast water treatment system (BWTS) designed by AlfaLaval that works as an extension to the shipsrsquo water pumping system The BWTSuses a filter for physical separation of organisms and total suspended solids (TSS)and a UV reactor for main treatment of the ballast water As PB3 can be installedon a variety of ships the BWTS must be able to process different waters underdifferent conditions to fulfill the requirements of the BWM convention

12 Problem Description

In the existing system there is currently no way of detecting if the filter in use isabout to clog A clogged filter forces a stop for the entire process and the onlyway to get the BWTS functional again is to physically remove the filter disas-semble it clean it reassemble it and put it back This cleaning process involvesrisks of damaging the filter is expensive to carry out and takes up unnecessary time

Furthermore due to the different concentrations of the TSS in waters around theworld as well as different supply flows and supply pressures from various shippumping system the load on the BWTS may vary greatly This imposes problemsin estimating actual system health when using a static model as there are discrep-ancies introduced in the measured parameters

1

CHAPTER 1 INTRODUCTION

These problems make it impossible to achieve optimal operability while ensuringthat the BWTS runs safely in every environment A desired solution is to developa method which can analyze and react to changes imposed to the system on thefly One such solution could be to use the large amounts of data generated by thesystem to create a neural network (NN) to estimate the state of the BWTS

13 Purpose Definitions amp Research Questions

The use of machine learning (ML) in this type of system is a new application of theotherwise popular statistical tool As there is no existing public information on howsuch an implementation can be done the focus of the thesis and the main researchquestion is

bull To investigate and evaluate the possibility of using ML for predictively esti-mating filter clogging with focus on maritime systems

An NN model will be developed and evaluated in terms of how accurately it canpredict the clogging of the filter in the BWTS using the systems sensor data Theimplementation of an NN into a system like this is the first of its kind and willrequire analysis and understanding of the system to ensure a working realizationTherefore with the BWTS in mind the original proposed research question can bespecified further to be

bull How can an NN be integrated with a BWTS to estimate clogging of a basketfilter

14 Scope and Delimitations

In comparison to the initial thesis scope provided in Appendix A some delimita-tions have been made First and foremost the only part of interest of the BWTSis the filter When designing the NN the data used will be from in-house testingas well as from the cloud service Connectivity so no alternations to the existinghardware and software can be made For the purpose of focusing on the ML modeland the system sensors all data are assumed to be constantly available

It is also not possible to test different kinds of ML methods or a plethora of NNswithin the time frame of the thesis For that reason the decision of developing anNN is based on its success in predictive analytics in other fields[1 2] and the frameof reference for deciding the most appropriate NN will depend on the characteristicsof the data and the future requirements of the NN

2

15 METHOD DESCRIPTION

15 Method Description

The task at hand is a combination of two On one hand it is an engineering taskinvestigating if an NN can be developed for the BWTS On the other hand it is ascientific task that focuses on applying and evaluating current research in the areasof NNs water filtration techniques time series analysis and predictive analytics tothe problem To complete both tasks within the time frame a methodology has tobe developed to ensure that the engineering task is done the research questions areanswered and that there is a clear foundation for future research The methodologyis visualized in Figure 11

The basis of the methodology starts with the problem description The problemdescription makes way for establishing the first frame of reference (FOR) and rais-ing the initial potential research questions Following that a better understandingabout the research field in question and continuous discussions with Alfa Laval willhelp adapt the FOR further Working through this process iteratively a final framecan be established where the research area is clear allowing for finalization of theresearch questions

With the frame of reference established the focus shifts to gathering informationthrough appropriate resources such as scientific articles and papers Interviews ex-isting documentation of the BWTS and future use of the NN also helps in narrowingdown the solution space to only contain relevant solutions that are optimal for thethesis With a smaller set of NNs to chose from the best suited network structurewill be developed tested and evaluated

In preparation of constructing and implementing the NN the data have to beprocessed according to the needs of the selected network Pre-processing strate-gies on how datasets are best prepared for NN processing will be investigated andexecuted to ensure that a clear methodology for future processing is executed anddocumented For correctly classifying the current clogging grade or rate of cloggingof the filter state of the art research will be used as reference when commencingwith the labelling of the data

When the classificationlabelling is done implementation through training and test-ing of the NN can begin Sequentially improvement to the NN structure will bemade by comparing the results of different initial weight conditions layer configu-rations and data partitions The experimental results will be graded in terms ofpredictive accuracy and the estimated errors of each individual parameter will betaken into consideration

Lastly the validation process can begin to ensure that the initial requirementsfrom the problem description are either met or that they have been investigatedWith the results at hand a conclusion can be presented describing how the system

3

CHAPTER 1 INTRODUCTION

can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

Figure 11 Proposed methodology for the thesis

4

Chapter 2

Frame of Reference

This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

21 Filtration amp Clogging Indicators

Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

211 Basket Filter

A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

5

CHAPTER 2 FRAME OF REFERENCE

Figure 21 An overview of a basket filter1

The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

212 Self-Cleaning Basket Filters

Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

1Source httpwwwfilter-technicsbe

6

21 FILTRATION amp CLOGGING INDICATORS

Figure 22 An overview of a basket filter with self-cleaning2

The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

213 Manometer

Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

2Source httpwwwdirectindustrycom

7

CHAPTER 2 FRAME OF REFERENCE

214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

2 a decrease in Q as a result of an increase in ∆p

These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

1 steady state ∆p and Qrarr Nolittle clogging

2 linear increase in ∆p and steady Qrarr Moderate clogging

3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

Figure 23 Visualization of the clogging states3

3Source Eker et al [6]

8

21 FILTRATION amp CLOGGING INDICATORS

215 Physics-based Modelling

The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

QL = KA

microL∆p (21)

rewritten as

∆p = microL

KAQL (22)

A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

∆p = kVsmicro

Φ2D2p

(1minus ε)2L

ε3(23)

Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

∆p = 150Vsmicro(1minus ε)2L

D2pε

3 + 175(1minus ε)ρV 2s L

ε3Dp(24)

where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

Table 21 Variable explanation for Ergunrsquos equation

Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

Dp Diameter of the spherical particle mρ Density of the liquid kgm3

9

CHAPTER 2 FRAME OF REFERENCE

Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

Table 22 Outputs of a confusion matrix

PredictionPositive Negative

Act

ual Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

ACC =sumn

i=1 jin

where ji =

1 if yi = yi

0 if yi 6= yi

(25)

by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

10

22 PREDICTIVE ANALYTICS

In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

221 Classification Error Metrics

Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

Area Under Curve (AUC)

AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

sensitivity = TP

TP + FN(26)

specificity = TN

TN + FP(27)

The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

F1 Score

The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

precision = TP

TP + FP(28)

recall = TP

TP + FN(29)

F1 = 2times precisiontimes recallprecision+ recall

(210)

11

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 8: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 1 INTRODUCTION

These problems make it impossible to achieve optimal operability while ensuringthat the BWTS runs safely in every environment A desired solution is to developa method which can analyze and react to changes imposed to the system on thefly One such solution could be to use the large amounts of data generated by thesystem to create a neural network (NN) to estimate the state of the BWTS

13 Purpose Definitions amp Research Questions

The use of machine learning (ML) in this type of system is a new application of theotherwise popular statistical tool As there is no existing public information on howsuch an implementation can be done the focus of the thesis and the main researchquestion is

bull To investigate and evaluate the possibility of using ML for predictively esti-mating filter clogging with focus on maritime systems

An NN model will be developed and evaluated in terms of how accurately it canpredict the clogging of the filter in the BWTS using the systems sensor data Theimplementation of an NN into a system like this is the first of its kind and willrequire analysis and understanding of the system to ensure a working realizationTherefore with the BWTS in mind the original proposed research question can bespecified further to be

bull How can an NN be integrated with a BWTS to estimate clogging of a basketfilter

14 Scope and Delimitations

In comparison to the initial thesis scope provided in Appendix A some delimita-tions have been made First and foremost the only part of interest of the BWTSis the filter When designing the NN the data used will be from in-house testingas well as from the cloud service Connectivity so no alternations to the existinghardware and software can be made For the purpose of focusing on the ML modeland the system sensors all data are assumed to be constantly available

It is also not possible to test different kinds of ML methods or a plethora of NNswithin the time frame of the thesis For that reason the decision of developing anNN is based on its success in predictive analytics in other fields[1 2] and the frameof reference for deciding the most appropriate NN will depend on the characteristicsof the data and the future requirements of the NN

2

15 METHOD DESCRIPTION

15 Method Description

The task at hand is a combination of two On one hand it is an engineering taskinvestigating if an NN can be developed for the BWTS On the other hand it is ascientific task that focuses on applying and evaluating current research in the areasof NNs water filtration techniques time series analysis and predictive analytics tothe problem To complete both tasks within the time frame a methodology has tobe developed to ensure that the engineering task is done the research questions areanswered and that there is a clear foundation for future research The methodologyis visualized in Figure 11

The basis of the methodology starts with the problem description The problemdescription makes way for establishing the first frame of reference (FOR) and rais-ing the initial potential research questions Following that a better understandingabout the research field in question and continuous discussions with Alfa Laval willhelp adapt the FOR further Working through this process iteratively a final framecan be established where the research area is clear allowing for finalization of theresearch questions

With the frame of reference established the focus shifts to gathering informationthrough appropriate resources such as scientific articles and papers Interviews ex-isting documentation of the BWTS and future use of the NN also helps in narrowingdown the solution space to only contain relevant solutions that are optimal for thethesis With a smaller set of NNs to chose from the best suited network structurewill be developed tested and evaluated

In preparation of constructing and implementing the NN the data have to beprocessed according to the needs of the selected network Pre-processing strate-gies on how datasets are best prepared for NN processing will be investigated andexecuted to ensure that a clear methodology for future processing is executed anddocumented For correctly classifying the current clogging grade or rate of cloggingof the filter state of the art research will be used as reference when commencingwith the labelling of the data

When the classificationlabelling is done implementation through training and test-ing of the NN can begin Sequentially improvement to the NN structure will bemade by comparing the results of different initial weight conditions layer configu-rations and data partitions The experimental results will be graded in terms ofpredictive accuracy and the estimated errors of each individual parameter will betaken into consideration

Lastly the validation process can begin to ensure that the initial requirementsfrom the problem description are either met or that they have been investigatedWith the results at hand a conclusion can be presented describing how the system

3

CHAPTER 1 INTRODUCTION

can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

Figure 11 Proposed methodology for the thesis

4

Chapter 2

Frame of Reference

This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

21 Filtration amp Clogging Indicators

Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

211 Basket Filter

A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

5

CHAPTER 2 FRAME OF REFERENCE

Figure 21 An overview of a basket filter1

The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

212 Self-Cleaning Basket Filters

Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

1Source httpwwwfilter-technicsbe

6

21 FILTRATION amp CLOGGING INDICATORS

Figure 22 An overview of a basket filter with self-cleaning2

The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

213 Manometer

Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

2Source httpwwwdirectindustrycom

7

CHAPTER 2 FRAME OF REFERENCE

214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

2 a decrease in Q as a result of an increase in ∆p

These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

1 steady state ∆p and Qrarr Nolittle clogging

2 linear increase in ∆p and steady Qrarr Moderate clogging

3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

Figure 23 Visualization of the clogging states3

3Source Eker et al [6]

8

21 FILTRATION amp CLOGGING INDICATORS

215 Physics-based Modelling

The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

QL = KA

microL∆p (21)

rewritten as

∆p = microL

KAQL (22)

A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

∆p = kVsmicro

Φ2D2p

(1minus ε)2L

ε3(23)

Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

∆p = 150Vsmicro(1minus ε)2L

D2pε

3 + 175(1minus ε)ρV 2s L

ε3Dp(24)

where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

Table 21 Variable explanation for Ergunrsquos equation

Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

Dp Diameter of the spherical particle mρ Density of the liquid kgm3

9

CHAPTER 2 FRAME OF REFERENCE

Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

Table 22 Outputs of a confusion matrix

PredictionPositive Negative

Act

ual Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

ACC =sumn

i=1 jin

where ji =

1 if yi = yi

0 if yi 6= yi

(25)

by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

10

22 PREDICTIVE ANALYTICS

In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

221 Classification Error Metrics

Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

Area Under Curve (AUC)

AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

sensitivity = TP

TP + FN(26)

specificity = TN

TN + FP(27)

The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

F1 Score

The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

precision = TP

TP + FP(28)

recall = TP

TP + FN(29)

F1 = 2times precisiontimes recallprecision+ recall

(210)

11

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 9: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

15 METHOD DESCRIPTION

15 Method Description

The task at hand is a combination of two On one hand it is an engineering taskinvestigating if an NN can be developed for the BWTS On the other hand it is ascientific task that focuses on applying and evaluating current research in the areasof NNs water filtration techniques time series analysis and predictive analytics tothe problem To complete both tasks within the time frame a methodology has tobe developed to ensure that the engineering task is done the research questions areanswered and that there is a clear foundation for future research The methodologyis visualized in Figure 11

The basis of the methodology starts with the problem description The problemdescription makes way for establishing the first frame of reference (FOR) and rais-ing the initial potential research questions Following that a better understandingabout the research field in question and continuous discussions with Alfa Laval willhelp adapt the FOR further Working through this process iteratively a final framecan be established where the research area is clear allowing for finalization of theresearch questions

With the frame of reference established the focus shifts to gathering informationthrough appropriate resources such as scientific articles and papers Interviews ex-isting documentation of the BWTS and future use of the NN also helps in narrowingdown the solution space to only contain relevant solutions that are optimal for thethesis With a smaller set of NNs to chose from the best suited network structurewill be developed tested and evaluated

In preparation of constructing and implementing the NN the data have to beprocessed according to the needs of the selected network Pre-processing strate-gies on how datasets are best prepared for NN processing will be investigated andexecuted to ensure that a clear methodology for future processing is executed anddocumented For correctly classifying the current clogging grade or rate of cloggingof the filter state of the art research will be used as reference when commencingwith the labelling of the data

When the classificationlabelling is done implementation through training and test-ing of the NN can begin Sequentially improvement to the NN structure will bemade by comparing the results of different initial weight conditions layer configu-rations and data partitions The experimental results will be graded in terms ofpredictive accuracy and the estimated errors of each individual parameter will betaken into consideration

Lastly the validation process can begin to ensure that the initial requirementsfrom the problem description are either met or that they have been investigatedWith the results at hand a conclusion can be presented describing how the system

3

CHAPTER 1 INTRODUCTION

can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

Figure 11 Proposed methodology for the thesis

4

Chapter 2

Frame of Reference

This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

21 Filtration amp Clogging Indicators

Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

211 Basket Filter

A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

5

CHAPTER 2 FRAME OF REFERENCE

Figure 21 An overview of a basket filter1

The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

212 Self-Cleaning Basket Filters

Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

1Source httpwwwfilter-technicsbe

6

21 FILTRATION amp CLOGGING INDICATORS

Figure 22 An overview of a basket filter with self-cleaning2

The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

213 Manometer

Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

2Source httpwwwdirectindustrycom

7

CHAPTER 2 FRAME OF REFERENCE

214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

2 a decrease in Q as a result of an increase in ∆p

These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

1 steady state ∆p and Qrarr Nolittle clogging

2 linear increase in ∆p and steady Qrarr Moderate clogging

3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

Figure 23 Visualization of the clogging states3

3Source Eker et al [6]

8

21 FILTRATION amp CLOGGING INDICATORS

215 Physics-based Modelling

The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

QL = KA

microL∆p (21)

rewritten as

∆p = microL

KAQL (22)

A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

∆p = kVsmicro

Φ2D2p

(1minus ε)2L

ε3(23)

Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

∆p = 150Vsmicro(1minus ε)2L

D2pε

3 + 175(1minus ε)ρV 2s L

ε3Dp(24)

where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

Table 21 Variable explanation for Ergunrsquos equation

Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

Dp Diameter of the spherical particle mρ Density of the liquid kgm3

9

CHAPTER 2 FRAME OF REFERENCE

Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

Table 22 Outputs of a confusion matrix

PredictionPositive Negative

Act

ual Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

ACC =sumn

i=1 jin

where ji =

1 if yi = yi

0 if yi 6= yi

(25)

by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

10

22 PREDICTIVE ANALYTICS

In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

221 Classification Error Metrics

Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

Area Under Curve (AUC)

AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

sensitivity = TP

TP + FN(26)

specificity = TN

TN + FP(27)

The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

F1 Score

The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

precision = TP

TP + FP(28)

recall = TP

TP + FN(29)

F1 = 2times precisiontimes recallprecision+ recall

(210)

11

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 10: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 1 INTRODUCTION

can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

Figure 11 Proposed methodology for the thesis

4

Chapter 2

Frame of Reference

This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

21 Filtration amp Clogging Indicators

Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

211 Basket Filter

A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

5

CHAPTER 2 FRAME OF REFERENCE

Figure 21 An overview of a basket filter1

The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

212 Self-Cleaning Basket Filters

Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

1Source httpwwwfilter-technicsbe

6

21 FILTRATION amp CLOGGING INDICATORS

Figure 22 An overview of a basket filter with self-cleaning2

The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

213 Manometer

Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

2Source httpwwwdirectindustrycom

7

CHAPTER 2 FRAME OF REFERENCE

214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

2 a decrease in Q as a result of an increase in ∆p

These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

1 steady state ∆p and Qrarr Nolittle clogging

2 linear increase in ∆p and steady Qrarr Moderate clogging

3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

Figure 23 Visualization of the clogging states3

3Source Eker et al [6]

8

21 FILTRATION amp CLOGGING INDICATORS

215 Physics-based Modelling

The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

QL = KA

microL∆p (21)

rewritten as

∆p = microL

KAQL (22)

A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

∆p = kVsmicro

Φ2D2p

(1minus ε)2L

ε3(23)

Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

∆p = 150Vsmicro(1minus ε)2L

D2pε

3 + 175(1minus ε)ρV 2s L

ε3Dp(24)

where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

Table 21 Variable explanation for Ergunrsquos equation

Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

Dp Diameter of the spherical particle mρ Density of the liquid kgm3

9

CHAPTER 2 FRAME OF REFERENCE

Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

Table 22 Outputs of a confusion matrix

PredictionPositive Negative

Act

ual Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

ACC =sumn

i=1 jin

where ji =

1 if yi = yi

0 if yi 6= yi

(25)

by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

10

22 PREDICTIVE ANALYTICS

In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

221 Classification Error Metrics

Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

Area Under Curve (AUC)

AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

sensitivity = TP

TP + FN(26)

specificity = TN

TN + FP(27)

The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

F1 Score

The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

precision = TP

TP + FP(28)

recall = TP

TP + FN(29)

F1 = 2times precisiontimes recallprecision+ recall

(210)

11

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 11: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

Chapter 2

Frame of Reference

This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

21 Filtration amp Clogging Indicators

Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

211 Basket Filter

A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

5

CHAPTER 2 FRAME OF REFERENCE

Figure 21 An overview of a basket filter1

The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

212 Self-Cleaning Basket Filters

Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

1Source httpwwwfilter-technicsbe

6

21 FILTRATION amp CLOGGING INDICATORS

Figure 22 An overview of a basket filter with self-cleaning2

The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

213 Manometer

Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

2Source httpwwwdirectindustrycom

7

CHAPTER 2 FRAME OF REFERENCE

214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

2 a decrease in Q as a result of an increase in ∆p

These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

1 steady state ∆p and Qrarr Nolittle clogging

2 linear increase in ∆p and steady Qrarr Moderate clogging

3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

Figure 23 Visualization of the clogging states3

3Source Eker et al [6]

8

21 FILTRATION amp CLOGGING INDICATORS

215 Physics-based Modelling

The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

QL = KA

microL∆p (21)

rewritten as

∆p = microL

KAQL (22)

A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

∆p = kVsmicro

Φ2D2p

(1minus ε)2L

ε3(23)

Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

∆p = 150Vsmicro(1minus ε)2L

D2pε

3 + 175(1minus ε)ρV 2s L

ε3Dp(24)

where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

Table 21 Variable explanation for Ergunrsquos equation

Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

Dp Diameter of the spherical particle mρ Density of the liquid kgm3

9

CHAPTER 2 FRAME OF REFERENCE

Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

Table 22 Outputs of a confusion matrix

PredictionPositive Negative

Act

ual Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

ACC =sumn

i=1 jin

where ji =

1 if yi = yi

0 if yi 6= yi

(25)

by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

10

22 PREDICTIVE ANALYTICS

In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

221 Classification Error Metrics

Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

Area Under Curve (AUC)

AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

sensitivity = TP

TP + FN(26)

specificity = TN

TN + FP(27)

The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

F1 Score

The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

precision = TP

TP + FP(28)

recall = TP

TP + FN(29)

F1 = 2times precisiontimes recallprecision+ recall

(210)

11

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 12: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 2 FRAME OF REFERENCE

Figure 21 An overview of a basket filter1

The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

212 Self-Cleaning Basket Filters

Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

1Source httpwwwfilter-technicsbe

6

21 FILTRATION amp CLOGGING INDICATORS

Figure 22 An overview of a basket filter with self-cleaning2

The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

213 Manometer

Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

2Source httpwwwdirectindustrycom

7

CHAPTER 2 FRAME OF REFERENCE

214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

2 a decrease in Q as a result of an increase in ∆p

These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

1 steady state ∆p and Qrarr Nolittle clogging

2 linear increase in ∆p and steady Qrarr Moderate clogging

3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

Figure 23 Visualization of the clogging states3

3Source Eker et al [6]

8

21 FILTRATION amp CLOGGING INDICATORS

215 Physics-based Modelling

The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

QL = KA

microL∆p (21)

rewritten as

∆p = microL

KAQL (22)

A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

∆p = kVsmicro

Φ2D2p

(1minus ε)2L

ε3(23)

Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

∆p = 150Vsmicro(1minus ε)2L

D2pε

3 + 175(1minus ε)ρV 2s L

ε3Dp(24)

where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

Table 21 Variable explanation for Ergunrsquos equation

Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

Dp Diameter of the spherical particle mρ Density of the liquid kgm3

9

CHAPTER 2 FRAME OF REFERENCE

Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

Table 22 Outputs of a confusion matrix

PredictionPositive Negative

Act

ual Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

ACC =sumn

i=1 jin

where ji =

1 if yi = yi

0 if yi 6= yi

(25)

by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

10

22 PREDICTIVE ANALYTICS

In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

221 Classification Error Metrics

Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

Area Under Curve (AUC)

AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

sensitivity = TP

TP + FN(26)

specificity = TN

TN + FP(27)

The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

F1 Score

The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

precision = TP

TP + FP(28)

recall = TP

TP + FN(29)

F1 = 2times precisiontimes recallprecision+ recall

(210)

11

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 13: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

21 FILTRATION amp CLOGGING INDICATORS

Figure 22 An overview of a basket filter with self-cleaning2

The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

213 Manometer

Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

2Source httpwwwdirectindustrycom

7

CHAPTER 2 FRAME OF REFERENCE

214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

2 a decrease in Q as a result of an increase in ∆p

These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

1 steady state ∆p and Qrarr Nolittle clogging

2 linear increase in ∆p and steady Qrarr Moderate clogging

3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

Figure 23 Visualization of the clogging states3

3Source Eker et al [6]

8

21 FILTRATION amp CLOGGING INDICATORS

215 Physics-based Modelling

The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

QL = KA

microL∆p (21)

rewritten as

∆p = microL

KAQL (22)

A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

∆p = kVsmicro

Φ2D2p

(1minus ε)2L

ε3(23)

Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

∆p = 150Vsmicro(1minus ε)2L

D2pε

3 + 175(1minus ε)ρV 2s L

ε3Dp(24)

where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

Table 21 Variable explanation for Ergunrsquos equation

Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

Dp Diameter of the spherical particle mρ Density of the liquid kgm3

9

CHAPTER 2 FRAME OF REFERENCE

Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

Table 22 Outputs of a confusion matrix

PredictionPositive Negative

Act

ual Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

ACC =sumn

i=1 jin

where ji =

1 if yi = yi

0 if yi 6= yi

(25)

by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

10

22 PREDICTIVE ANALYTICS

In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

221 Classification Error Metrics

Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

Area Under Curve (AUC)

AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

sensitivity = TP

TP + FN(26)

specificity = TN

TN + FP(27)

The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

F1 Score

The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

precision = TP

TP + FP(28)

recall = TP

TP + FN(29)

F1 = 2times precisiontimes recallprecision+ recall

(210)

11

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 14: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 2 FRAME OF REFERENCE

214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

2 a decrease in Q as a result of an increase in ∆p

These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

1 steady state ∆p and Qrarr Nolittle clogging

2 linear increase in ∆p and steady Qrarr Moderate clogging

3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

Figure 23 Visualization of the clogging states3

3Source Eker et al [6]

8

21 FILTRATION amp CLOGGING INDICATORS

215 Physics-based Modelling

The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

QL = KA

microL∆p (21)

rewritten as

∆p = microL

KAQL (22)

A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

∆p = kVsmicro

Φ2D2p

(1minus ε)2L

ε3(23)

Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

∆p = 150Vsmicro(1minus ε)2L

D2pε

3 + 175(1minus ε)ρV 2s L

ε3Dp(24)

where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

Table 21 Variable explanation for Ergunrsquos equation

Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

Dp Diameter of the spherical particle mρ Density of the liquid kgm3

9

CHAPTER 2 FRAME OF REFERENCE

Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

Table 22 Outputs of a confusion matrix

PredictionPositive Negative

Act

ual Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

ACC =sumn

i=1 jin

where ji =

1 if yi = yi

0 if yi 6= yi

(25)

by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

10

22 PREDICTIVE ANALYTICS

In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

221 Classification Error Metrics

Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

Area Under Curve (AUC)

AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

sensitivity = TP

TP + FN(26)

specificity = TN

TN + FP(27)

The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

F1 Score

The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

precision = TP

TP + FP(28)

recall = TP

TP + FN(29)

F1 = 2times precisiontimes recallprecision+ recall

(210)

11

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 15: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

21 FILTRATION amp CLOGGING INDICATORS

215 Physics-based Modelling

The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

QL = KA

microL∆p (21)

rewritten as

∆p = microL

KAQL (22)

A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

∆p = kVsmicro

Φ2D2p

(1minus ε)2L

ε3(23)

Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

∆p = 150Vsmicro(1minus ε)2L

D2pε

3 + 175(1minus ε)ρV 2s L

ε3Dp(24)

where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

Table 21 Variable explanation for Ergunrsquos equation

Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

Dp Diameter of the spherical particle mρ Density of the liquid kgm3

9

CHAPTER 2 FRAME OF REFERENCE

Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

Table 22 Outputs of a confusion matrix

PredictionPositive Negative

Act

ual Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

ACC =sumn

i=1 jin

where ji =

1 if yi = yi

0 if yi 6= yi

(25)

by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

10

22 PREDICTIVE ANALYTICS

In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

221 Classification Error Metrics

Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

Area Under Curve (AUC)

AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

sensitivity = TP

TP + FN(26)

specificity = TN

TN + FP(27)

The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

F1 Score

The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

precision = TP

TP + FP(28)

recall = TP

TP + FN(29)

F1 = 2times precisiontimes recallprecision+ recall

(210)

11

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 16: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 2 FRAME OF REFERENCE

Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

Table 22 Outputs of a confusion matrix

PredictionPositive Negative

Act

ual Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

ACC =sumn

i=1 jin

where ji =

1 if yi = yi

0 if yi 6= yi

(25)

by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

10

22 PREDICTIVE ANALYTICS

In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

221 Classification Error Metrics

Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

Area Under Curve (AUC)

AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

sensitivity = TP

TP + FN(26)

specificity = TN

TN + FP(27)

The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

F1 Score

The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

precision = TP

TP + FP(28)

recall = TP

TP + FN(29)

F1 = 2times precisiontimes recallprecision+ recall

(210)

11

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 17: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

22 PREDICTIVE ANALYTICS

In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

221 Classification Error Metrics

Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

Area Under Curve (AUC)

AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

sensitivity = TP

TP + FN(26)

specificity = TN

TN + FP(27)

The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

F1 Score

The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

precision = TP

TP + FP(28)

recall = TP

TP + FN(29)

F1 = 2times precisiontimes recallprecision+ recall

(210)

11

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 18: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 19: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 20: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 21: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 22: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 23: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 24: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 25: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 26: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 27: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 28: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 29: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 30: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 31: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 32: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 33: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 34: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 35: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 36: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 37: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 38: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 39: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 40: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 41: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 42: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 43: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 44: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 45: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 46: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 47: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 48: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 49: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 50: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 51: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 52: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 53: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 54: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography
Page 55: A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography