A Machine Learning Approach to Predictively Determine ...kth.diva-portal.org/smash/get/diva2:1371211/FULLTEXT01.pdfatt unders¨oka om maskininl¨arning genom neurala n¨atv¨ark kan

Post on 13-Mar-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

IN DEGREE PROJECT MECHANICAL ENGINEERINGSECOND CYCLE 30 CREDITS

STOCKHOLM SWEDEN 2019

A Machine Learning Approach to Predictively Determine Filter Clogging in a Ballast Water Treatment System

KRISTOFFER SLIWINSKI

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF INDUSTRIAL ENGINEERING AND MANAGEMENT

AbstractSince the introduction of the Ballast Water Management Convention ballast watertreatment systems are required to be used on ships for processing the ballast waterto avoid spreading bacteria or other microbes which can destroy foreign ecosystemsOne way of pre-processing the water for treatment is by straining the water througha filtration unit When the filter mesh retains particles it begins to clog and couldpotentially clog rapidly if the concentration of particles in the water is high Theclog jeopardises the system The thesis aims at investigating if machine learningthrough neural networks can be implemented with the system to predictively deter-mine filter clogging by investigating two popular network structures for time seriesanalysis

The problem came down to initially determine different grades of clogging for the fil-ter element based on sampled sensor data from the ballast water treatment systemThe data were then put through regression analysis through two neural networksfor parameter prediction one LSTM and one CNN The LSTM predicted values ofvariable and clogging labels for the next 5 seconds and the CNN predicted values ofvariable and clogging labels for the next 30 seconds The predicted data were thenverified through classification analysis by an LSTM network and a CNN

The LSTM regression network achieved an r2-score of 0981 and the LSTM classi-fication network achieved a classification accuracy of 995 The CNN regressionnetwork achieved an r2-score of 0876 and the CNN classification network achieved aclassification accuracy of 933 The results conclude that ML can be used for iden-tifying different grades of clogging but that further research is required to determineif all clogging states can be classified

SammanfattningSedan Ballast Water Management-konventionen introducerades har fartyg behovtanvanda barlastvattensystem for att behandla barlastvattnet i ett forsok att hammaspridningen av bakterier och andra mikrober som kan vara farliga for frammandeecosystem Ett satt att forbehandla vattnet innan behandling ar genom att lata detpassera genom ett filter Nar filtret samlar pa sig partiklar borjar det att kloggaoch kan potentiellt klogga igen snabbt nar koncentrationen av partiklar i vattnet arhog Kloggning kan aventyra systemets sakerhet Det har examensarbetet amnaratt undersoka om maskininlarning genom neurala natvark kan implementeras i sys-temet for att prediktivt bestamma filtrets kloggningsgrad genom att undersokalampligheten hos tva populara natverksstrukturer for tidsserieanalys

Problemet handlade initialt om att bedomma olika kloggningsgrader for filterele-mentet baserat pa samplade sensordata fran barlastvattensystemet Datan kordessedan for regressionsanalys genom tva neurala natverk ett av typen LSTM ochett av typen CNN for att prediktivt bestamma paramterarna LSTM-natvarketuppskattade variabelvarden och kloggningsgrad for de kommande 5 sekundrarnamedan CNNet uppskattade variabelvarden och kloggningsgrad for de kommande30 sekunderna Den uppskattade datan verifierades sedan genom klassificering avett LSTM natverk och tva CNN

LSTM natverket for regression uppnadde ett r2-resultat pa 0981 och LSTM natver-ket for klassificering uppnadde en klassificeringsgrad pa 995 CNNet for regres-sion uppnadde ett r2-resultat pa 0876 och CNNet for klassificering uppnadde enklassificeringsgrad pa 933 Resultatet visar att ML kan anvandas for att identi-fiera olika kloggningsgrad men ytterligare forskning kravs for att bestamma om allakloggningsstadier kan klassificeras

Nomenclature

ARIMA Autoregressive Integrated Moving Average

AUC Area Under Curve

BWTS Ballast Water Treatment System

CNN Convolutional Neural Network

FOR Frame of Reference

LSTM Long Short Term Memory

ML Machine Learning

MAE Mean Absolute Error

MSE Mean Squared Error

NN Neural Network

ReLU Rectified Linear Unit

RMSE Root Mean Squared Error

TSS Total Suspended Solids

Contents

1 Introduction 111 Background 112 Problem Description 113 Purpose Definitions amp Research Questions 214 Scope and Delimitations 215 Method Description 3

2 Frame of Reference 521 Filtration amp Clogging Indicators 5

211 Basket Filter 5212 Self-Cleaning Basket Filters 6213 Manometer 7214 The Clogging Phenomena 8215 Physics-based Modelling 9

22 Predictive Analytics 10221 Classification Error Metrics 11222 Regression Error Metrics 12223 Stochastic Time Series Models 14

23 Neural Networks 15231 Overview 15232 The Perceptron 16233 Activation functions 16234 Neural Network Architectures 17

3 Experimental Development 2331 Data Gathering and Processing 2332 Model Generation 26

321 Regression Processing with the LSTM Model 27322 Regression Processing with the CNN Model 28323 Label Classification 29

33 Model evaluation 3034 Hardware Specifications 31

4 Results 3341 LSTM Performance 3342 CNN Performance 36

5 Discussion amp Conclusion 4151 The LSTM Network 41

511 Regression Analysis 41512 Classification Analysis 42

52 The CNN 42521 Regression Analysis 42522 Classification Analysis 43

53 Comparison Between Both Networks 4454 Conclusion 44

6 Future Work 45

Bibliography 47

Chapter 1

Introduction

11 Background

Ballast water tanks are used on ships to stabilize the ship for different shippingloads When a ship isnrsquot fully loaded or when a ship must run at a sufficient depthwater is pumped into the ballast water tanks through a water pumping system Topreserve existing ecosystems as well as preventing the spread of bacteria larvaeor other microbes ballast water management is regulated world-wide by the Inter-national Convention for the Control and Management of Shipsrsquo Ballast Water andSediments (BWM convention)

PureBallast 3 (PB3) is a ballast water treatment system (BWTS) designed by AlfaLaval that works as an extension to the shipsrsquo water pumping system The BWTSuses a filter for physical separation of organisms and total suspended solids (TSS)and a UV reactor for main treatment of the ballast water As PB3 can be installedon a variety of ships the BWTS must be able to process different waters underdifferent conditions to fulfill the requirements of the BWM convention

12 Problem Description

In the existing system there is currently no way of detecting if the filter in use isabout to clog A clogged filter forces a stop for the entire process and the onlyway to get the BWTS functional again is to physically remove the filter disas-semble it clean it reassemble it and put it back This cleaning process involvesrisks of damaging the filter is expensive to carry out and takes up unnecessary time

Furthermore due to the different concentrations of the TSS in waters around theworld as well as different supply flows and supply pressures from various shippumping system the load on the BWTS may vary greatly This imposes problemsin estimating actual system health when using a static model as there are discrep-ancies introduced in the measured parameters

1

CHAPTER 1 INTRODUCTION

These problems make it impossible to achieve optimal operability while ensuringthat the BWTS runs safely in every environment A desired solution is to developa method which can analyze and react to changes imposed to the system on thefly One such solution could be to use the large amounts of data generated by thesystem to create a neural network (NN) to estimate the state of the BWTS

13 Purpose Definitions amp Research Questions

The use of machine learning (ML) in this type of system is a new application of theotherwise popular statistical tool As there is no existing public information on howsuch an implementation can be done the focus of the thesis and the main researchquestion is

bull To investigate and evaluate the possibility of using ML for predictively esti-mating filter clogging with focus on maritime systems

An NN model will be developed and evaluated in terms of how accurately it canpredict the clogging of the filter in the BWTS using the systems sensor data Theimplementation of an NN into a system like this is the first of its kind and willrequire analysis and understanding of the system to ensure a working realizationTherefore with the BWTS in mind the original proposed research question can bespecified further to be

bull How can an NN be integrated with a BWTS to estimate clogging of a basketfilter

14 Scope and Delimitations

In comparison to the initial thesis scope provided in Appendix A some delimita-tions have been made First and foremost the only part of interest of the BWTSis the filter When designing the NN the data used will be from in-house testingas well as from the cloud service Connectivity so no alternations to the existinghardware and software can be made For the purpose of focusing on the ML modeland the system sensors all data are assumed to be constantly available

It is also not possible to test different kinds of ML methods or a plethora of NNswithin the time frame of the thesis For that reason the decision of developing anNN is based on its success in predictive analytics in other fields[1 2] and the frameof reference for deciding the most appropriate NN will depend on the characteristicsof the data and the future requirements of the NN

2

15 METHOD DESCRIPTION

15 Method Description

The task at hand is a combination of two On one hand it is an engineering taskinvestigating if an NN can be developed for the BWTS On the other hand it is ascientific task that focuses on applying and evaluating current research in the areasof NNs water filtration techniques time series analysis and predictive analytics tothe problem To complete both tasks within the time frame a methodology has tobe developed to ensure that the engineering task is done the research questions areanswered and that there is a clear foundation for future research The methodologyis visualized in Figure 11

The basis of the methodology starts with the problem description The problemdescription makes way for establishing the first frame of reference (FOR) and rais-ing the initial potential research questions Following that a better understandingabout the research field in question and continuous discussions with Alfa Laval willhelp adapt the FOR further Working through this process iteratively a final framecan be established where the research area is clear allowing for finalization of theresearch questions

With the frame of reference established the focus shifts to gathering informationthrough appropriate resources such as scientific articles and papers Interviews ex-isting documentation of the BWTS and future use of the NN also helps in narrowingdown the solution space to only contain relevant solutions that are optimal for thethesis With a smaller set of NNs to chose from the best suited network structurewill be developed tested and evaluated

In preparation of constructing and implementing the NN the data have to beprocessed according to the needs of the selected network Pre-processing strate-gies on how datasets are best prepared for NN processing will be investigated andexecuted to ensure that a clear methodology for future processing is executed anddocumented For correctly classifying the current clogging grade or rate of cloggingof the filter state of the art research will be used as reference when commencingwith the labelling of the data

When the classificationlabelling is done implementation through training and test-ing of the NN can begin Sequentially improvement to the NN structure will bemade by comparing the results of different initial weight conditions layer configu-rations and data partitions The experimental results will be graded in terms ofpredictive accuracy and the estimated errors of each individual parameter will betaken into consideration

Lastly the validation process can begin to ensure that the initial requirementsfrom the problem description are either met or that they have been investigatedWith the results at hand a conclusion can be presented describing how the system

3

CHAPTER 1 INTRODUCTION

can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

Figure 11 Proposed methodology for the thesis

4

Chapter 2

Frame of Reference

This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

21 Filtration amp Clogging Indicators

Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

211 Basket Filter

A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

5

CHAPTER 2 FRAME OF REFERENCE

Figure 21 An overview of a basket filter1

The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

212 Self-Cleaning Basket Filters

Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

1Source httpwwwfilter-technicsbe

6

21 FILTRATION amp CLOGGING INDICATORS

Figure 22 An overview of a basket filter with self-cleaning2

The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

213 Manometer

Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

2Source httpwwwdirectindustrycom

7

CHAPTER 2 FRAME OF REFERENCE

214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

2 a decrease in Q as a result of an increase in ∆p

These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

1 steady state ∆p and Qrarr Nolittle clogging

2 linear increase in ∆p and steady Qrarr Moderate clogging

3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

Figure 23 Visualization of the clogging states3

3Source Eker et al [6]

8

21 FILTRATION amp CLOGGING INDICATORS

215 Physics-based Modelling

The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

QL = KA

microL∆p (21)

rewritten as

∆p = microL

KAQL (22)

A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

∆p = kVsmicro

Φ2D2p

(1minus ε)2L

ε3(23)

Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

∆p = 150Vsmicro(1minus ε)2L

D2pε

3 + 175(1minus ε)ρV 2s L

ε3Dp(24)

where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

Table 21 Variable explanation for Ergunrsquos equation

Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

Dp Diameter of the spherical particle mρ Density of the liquid kgm3

9

CHAPTER 2 FRAME OF REFERENCE

Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

Table 22 Outputs of a confusion matrix

PredictionPositive Negative

Act

ual Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

ACC =sumn

i=1 jin

where ji =

1 if yi = yi

0 if yi 6= yi

(25)

by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

10

22 PREDICTIVE ANALYTICS

In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

221 Classification Error Metrics

Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

Area Under Curve (AUC)

AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

sensitivity = TP

TP + FN(26)

specificity = TN

TN + FP(27)

The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

F1 Score

The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

precision = TP

TP + FP(28)

recall = TP

TP + FN(29)

F1 = 2times precisiontimes recallprecision+ recall

(210)

11

CHAPTER 2 FRAME OF REFERENCE

Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

Logarithmic Loss (Log Loss)

For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

LogLoss = minusMsum

c=1yoclog(poc) (211)

222 Regression Error Metrics

Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

Mean Absolute Error (MAE)

Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

MAE = 1n

nsumi=1|yi minus yi| (212)

Mean Squared Error (MSE)

The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

12

22 PREDICTIVE ANALYTICS

MSE = 1n

nsumi=1

(yi minus yi)2 (213)

Root Mean Squared Error (RMSE)

RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

RMSE =

radicradicradicradic 1n

nsumi=1

(yi minus yi)2 (214)

The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

partRMSE

partyi= 1radic

MSE

partMSE

partyi(215)

Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

Mean Square Percentage Error (MSPE)

The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

MSPE = 100n

nsumi=1

(yi minus yi

yi

)2(216)

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

MAPE = 100n

nsumi=1

∣∣∣∣yi minus yi

yi

∣∣∣∣ (217)

13

CHAPTER 2 FRAME OF REFERENCE

Coefficient of Determination r2

To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

r2 =

sumni=1((yi minus yi)(yi minus yi))2radicsumn

i=1(yi minus yi)2sumni=1(yi minus yi)2

2

(218)

r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

Adjusted r2

Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

r2adj = 1minus

[(1minusr2)(nminus1)

nminuskminus1

](219)

Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

223 Stochastic Time Series Models

Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

Autoregressive Moving Average (ARMA)

The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

14

23 NEURAL NETWORKS

value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

23 Neural Networks

231 Overview

NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

15

CHAPTER 2 FRAME OF REFERENCE

properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

output =

0 if w middot x+ b le 01 if w middot x+ b gt 0

(220)

In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

Sigmoid Function

The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

f(z) = σ(z) = 11 + eminusz

(221)

for

z =sum

j

wj middot xj + b (222)

16

23 NEURAL NETWORKS

Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

Rectified Function

The rectifier activation function is defined as the positive part of its argument [34]

f(x) = x+ = max(0 x) (223)

for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

Swish Function

Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

f(x) = x middot sigmoid(βx) (224)

where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

Shallow Neural Networks (SNN)

SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

17

CHAPTER 2 FRAME OF REFERENCE

tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

Deep Neural Networks (DNN)

DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

f(x) = f (1) + f (2) + + f (n) (225)

where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

Recurring Neural Networks(RNN)

Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

x1 =[0 0 1 1 0 0 0

]x2 =

[0 0 0 1 1 0 0

]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

18

23 NEURAL NETWORKS

weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

Long Short Term Memory (LSTM) Networks

In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

it = σ(ωi

[htminus1 xt

]+ bi)

ot = σ(ωo

[htminus1 xt

]+ bo)

ft = σ(ωf

[htminus1 xt

]+ bf )

(226)

The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

Gated Recurrent Units (GRU)

GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

19

CHAPTER 2 FRAME OF REFERENCE

Convolutional Neural Networks (CNN)

The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

20

23 NEURAL NETWORKS

Figure 25 A max pooling layer with pool size 2 pooling an input

The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

Figure 26 A flattening layer flattening the feature map

21

Chapter 3

Experimental Development

This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

Figure 31 A complete test cycle

23

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

Figure 32 A test cycle with the backflush stop cut from the data

The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

24

31 DATA GATHERING AND PROCESSING

Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

25

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

the amount of data points and respective clogging labels for each test cycle can befound in Table 31

Table 31 Amount of data available after preprocessing

Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

Total 3195 1012 2903

When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

32 Model Generation

In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

26

32 MODEL GENERATION

variables The encoding can be done for both integers and tags such as123

rarr1 0 0

0 1 00 0 1

or

redbluegreen

rarr1 0 0

0 1 00 0 1

so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

xi minusmin(x)max(x)minusmin(x) (31)

Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

](32)

X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

](33)

27

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

bull Samples - The amount of data points

bull Time steps - The points of observation of the samples

bull Features - The observed variables per time step

The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

Figure 35 An overview of the LSTM network architecture

The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

322 Regression Processing with the CNN Model

As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

28

32 MODEL GENERATION

observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

Figure 36 An overview of the CNN architecture

Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

323 Label Classification

With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

29

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

33 Model evaluation

During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

30

34 HARDWARE SPECIFICATIONS

Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

34 Hardware Specifications

The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

31

CHAPTER 3 EXPERIMENTAL DEVELOPMENT

The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

32

Chapter 4

Results

This chapter presents the results for all the models presented in the previous chapter

41 LSTM Performance

Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

Figure 41 MAE and MSE loss for the LSTM

33

CHAPTER 4 RESULTS

Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

34

41 LSTM PERFORMANCE

Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 41 Evaluation metrics for the LSTM during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

35

CHAPTER 4 RESULTS

Table 42 Evaluation metrics for the LSTM during classification analysis

of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

Table 43 LSTM confusion matrix

PredictionLabel 1 Label 2

Act

ual Label 1 109 1

Label 2 3 669

42 CNN Performance

Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

Figure 47 MAE and MSE loss for the CNN

36

42 CNN PERFORMANCE

Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

37

CHAPTER 4 RESULTS

Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

Table 44 Evaluation metrics for the CNN during regression analysis

Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

38

42 CNN PERFORMANCE

Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

Table 45 Evaluation metrics for the CNN during classification analysis

Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

Table 46 CNN confusion matrix for data from the MAE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 82 29

Label 2 38 631

Table 47 CNN confusion matrix for data from the MSE regression network

PredictionLabel 1 Label 2

Act

ual Label 1 69 41

Label 2 11 659

39

Chapter 5

Discussion amp Conclusion

This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

51 The LSTM Network

511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

41

CHAPTER 5 DISCUSSION amp CONCLUSION

while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

512 Classification Analysis

As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

52 The CNN

521 Regression Analysis

The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

42

52 THE CNN

is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

522 Classification Analysis

Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

However the CNN is still doing a good job at predicting future clogging even

43

CHAPTER 5 DISCUSSION amp CONCLUSION

up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

44

Chapter 6

Future Work

In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

45

Bibliography

[1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

[2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

[3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

[4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

[5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

[6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

[7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

[8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

[9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

[10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

47

BIBLIOGRAPHY

[11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

[12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

[13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

[14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

[15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

[16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

[17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

[18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

[19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

[20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

[21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

48

BIBLIOGRAPHY

[22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

[23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

[24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

[25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

[26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

[27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

[28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

[29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

[30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

[31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

[32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

[33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

49

BIBLIOGRAPHY

models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

[34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

[35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

[36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

[37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

[38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

[40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

[41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

[42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

[43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

50

BIBLIOGRAPHY

[44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

[45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

[46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

[47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

[48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

[49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

[50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

51

TRITA TRITA-ITM-EX 2019606

wwwkthse

  • Introduction
    • Background
    • Problem Description
    • Purpose Definitions amp Research Questions
    • Scope and Delimitations
    • Method Description
      • Frame of Reference
        • Filtration amp Clogging Indicators
          • Basket Filter
          • Self-Cleaning Basket Filters
          • Manometer
          • The Clogging Phenomena
          • Physics-based Modelling
            • Predictive Analytics
              • Classification Error Metrics
              • Regression Error Metrics
              • Stochastic Time Series Models
                • Neural Networks
                  • Overview
                  • The Perceptron
                  • Activation functions
                  • Neural Network Architectures
                      • Experimental Development
                        • Data Gathering and Processing
                        • Model Generation
                          • Regression Processing with the LSTM Model
                          • Regression Processing with the CNN Model
                          • Label Classification
                            • Model evaluation
                            • Hardware Specifications
                              • Results
                                • LSTM Performance
                                • CNN Performance
                                  • Discussion amp Conclusion
                                    • The LSTM Network
                                      • Regression Analysis
                                      • Classification Analysis
                                        • The CNN
                                          • Regression Analysis
                                          • Classification Analysis
                                            • Comparison Between Both Networks
                                            • Conclusion
                                              • Future Work
                                              • Bibliography

    AbstractSince the introduction of the Ballast Water Management Convention ballast watertreatment systems are required to be used on ships for processing the ballast waterto avoid spreading bacteria or other microbes which can destroy foreign ecosystemsOne way of pre-processing the water for treatment is by straining the water througha filtration unit When the filter mesh retains particles it begins to clog and couldpotentially clog rapidly if the concentration of particles in the water is high Theclog jeopardises the system The thesis aims at investigating if machine learningthrough neural networks can be implemented with the system to predictively deter-mine filter clogging by investigating two popular network structures for time seriesanalysis

    The problem came down to initially determine different grades of clogging for the fil-ter element based on sampled sensor data from the ballast water treatment systemThe data were then put through regression analysis through two neural networksfor parameter prediction one LSTM and one CNN The LSTM predicted values ofvariable and clogging labels for the next 5 seconds and the CNN predicted values ofvariable and clogging labels for the next 30 seconds The predicted data were thenverified through classification analysis by an LSTM network and a CNN

    The LSTM regression network achieved an r2-score of 0981 and the LSTM classi-fication network achieved a classification accuracy of 995 The CNN regressionnetwork achieved an r2-score of 0876 and the CNN classification network achieved aclassification accuracy of 933 The results conclude that ML can be used for iden-tifying different grades of clogging but that further research is required to determineif all clogging states can be classified

    SammanfattningSedan Ballast Water Management-konventionen introducerades har fartyg behovtanvanda barlastvattensystem for att behandla barlastvattnet i ett forsok att hammaspridningen av bakterier och andra mikrober som kan vara farliga for frammandeecosystem Ett satt att forbehandla vattnet innan behandling ar genom att lata detpassera genom ett filter Nar filtret samlar pa sig partiklar borjar det att kloggaoch kan potentiellt klogga igen snabbt nar koncentrationen av partiklar i vattnet arhog Kloggning kan aventyra systemets sakerhet Det har examensarbetet amnaratt undersoka om maskininlarning genom neurala natvark kan implementeras i sys-temet for att prediktivt bestamma filtrets kloggningsgrad genom att undersokalampligheten hos tva populara natverksstrukturer for tidsserieanalys

    Problemet handlade initialt om att bedomma olika kloggningsgrader for filterele-mentet baserat pa samplade sensordata fran barlastvattensystemet Datan kordessedan for regressionsanalys genom tva neurala natverk ett av typen LSTM ochett av typen CNN for att prediktivt bestamma paramterarna LSTM-natvarketuppskattade variabelvarden och kloggningsgrad for de kommande 5 sekundrarnamedan CNNet uppskattade variabelvarden och kloggningsgrad for de kommande30 sekunderna Den uppskattade datan verifierades sedan genom klassificering avett LSTM natverk och tva CNN

    LSTM natverket for regression uppnadde ett r2-resultat pa 0981 och LSTM natver-ket for klassificering uppnadde en klassificeringsgrad pa 995 CNNet for regres-sion uppnadde ett r2-resultat pa 0876 och CNNet for klassificering uppnadde enklassificeringsgrad pa 933 Resultatet visar att ML kan anvandas for att identi-fiera olika kloggningsgrad men ytterligare forskning kravs for att bestamma om allakloggningsstadier kan klassificeras

    Nomenclature

    ARIMA Autoregressive Integrated Moving Average

    AUC Area Under Curve

    BWTS Ballast Water Treatment System

    CNN Convolutional Neural Network

    FOR Frame of Reference

    LSTM Long Short Term Memory

    ML Machine Learning

    MAE Mean Absolute Error

    MSE Mean Squared Error

    NN Neural Network

    ReLU Rectified Linear Unit

    RMSE Root Mean Squared Error

    TSS Total Suspended Solids

    Contents

    1 Introduction 111 Background 112 Problem Description 113 Purpose Definitions amp Research Questions 214 Scope and Delimitations 215 Method Description 3

    2 Frame of Reference 521 Filtration amp Clogging Indicators 5

    211 Basket Filter 5212 Self-Cleaning Basket Filters 6213 Manometer 7214 The Clogging Phenomena 8215 Physics-based Modelling 9

    22 Predictive Analytics 10221 Classification Error Metrics 11222 Regression Error Metrics 12223 Stochastic Time Series Models 14

    23 Neural Networks 15231 Overview 15232 The Perceptron 16233 Activation functions 16234 Neural Network Architectures 17

    3 Experimental Development 2331 Data Gathering and Processing 2332 Model Generation 26

    321 Regression Processing with the LSTM Model 27322 Regression Processing with the CNN Model 28323 Label Classification 29

    33 Model evaluation 3034 Hardware Specifications 31

    4 Results 3341 LSTM Performance 3342 CNN Performance 36

    5 Discussion amp Conclusion 4151 The LSTM Network 41

    511 Regression Analysis 41512 Classification Analysis 42

    52 The CNN 42521 Regression Analysis 42522 Classification Analysis 43

    53 Comparison Between Both Networks 4454 Conclusion 44

    6 Future Work 45

    Bibliography 47

    Chapter 1

    Introduction

    11 Background

    Ballast water tanks are used on ships to stabilize the ship for different shippingloads When a ship isnrsquot fully loaded or when a ship must run at a sufficient depthwater is pumped into the ballast water tanks through a water pumping system Topreserve existing ecosystems as well as preventing the spread of bacteria larvaeor other microbes ballast water management is regulated world-wide by the Inter-national Convention for the Control and Management of Shipsrsquo Ballast Water andSediments (BWM convention)

    PureBallast 3 (PB3) is a ballast water treatment system (BWTS) designed by AlfaLaval that works as an extension to the shipsrsquo water pumping system The BWTSuses a filter for physical separation of organisms and total suspended solids (TSS)and a UV reactor for main treatment of the ballast water As PB3 can be installedon a variety of ships the BWTS must be able to process different waters underdifferent conditions to fulfill the requirements of the BWM convention

    12 Problem Description

    In the existing system there is currently no way of detecting if the filter in use isabout to clog A clogged filter forces a stop for the entire process and the onlyway to get the BWTS functional again is to physically remove the filter disas-semble it clean it reassemble it and put it back This cleaning process involvesrisks of damaging the filter is expensive to carry out and takes up unnecessary time

    Furthermore due to the different concentrations of the TSS in waters around theworld as well as different supply flows and supply pressures from various shippumping system the load on the BWTS may vary greatly This imposes problemsin estimating actual system health when using a static model as there are discrep-ancies introduced in the measured parameters

    1

    CHAPTER 1 INTRODUCTION

    These problems make it impossible to achieve optimal operability while ensuringthat the BWTS runs safely in every environment A desired solution is to developa method which can analyze and react to changes imposed to the system on thefly One such solution could be to use the large amounts of data generated by thesystem to create a neural network (NN) to estimate the state of the BWTS

    13 Purpose Definitions amp Research Questions

    The use of machine learning (ML) in this type of system is a new application of theotherwise popular statistical tool As there is no existing public information on howsuch an implementation can be done the focus of the thesis and the main researchquestion is

    bull To investigate and evaluate the possibility of using ML for predictively esti-mating filter clogging with focus on maritime systems

    An NN model will be developed and evaluated in terms of how accurately it canpredict the clogging of the filter in the BWTS using the systems sensor data Theimplementation of an NN into a system like this is the first of its kind and willrequire analysis and understanding of the system to ensure a working realizationTherefore with the BWTS in mind the original proposed research question can bespecified further to be

    bull How can an NN be integrated with a BWTS to estimate clogging of a basketfilter

    14 Scope and Delimitations

    In comparison to the initial thesis scope provided in Appendix A some delimita-tions have been made First and foremost the only part of interest of the BWTSis the filter When designing the NN the data used will be from in-house testingas well as from the cloud service Connectivity so no alternations to the existinghardware and software can be made For the purpose of focusing on the ML modeland the system sensors all data are assumed to be constantly available

    It is also not possible to test different kinds of ML methods or a plethora of NNswithin the time frame of the thesis For that reason the decision of developing anNN is based on its success in predictive analytics in other fields[1 2] and the frameof reference for deciding the most appropriate NN will depend on the characteristicsof the data and the future requirements of the NN

    2

    15 METHOD DESCRIPTION

    15 Method Description

    The task at hand is a combination of two On one hand it is an engineering taskinvestigating if an NN can be developed for the BWTS On the other hand it is ascientific task that focuses on applying and evaluating current research in the areasof NNs water filtration techniques time series analysis and predictive analytics tothe problem To complete both tasks within the time frame a methodology has tobe developed to ensure that the engineering task is done the research questions areanswered and that there is a clear foundation for future research The methodologyis visualized in Figure 11

    The basis of the methodology starts with the problem description The problemdescription makes way for establishing the first frame of reference (FOR) and rais-ing the initial potential research questions Following that a better understandingabout the research field in question and continuous discussions with Alfa Laval willhelp adapt the FOR further Working through this process iteratively a final framecan be established where the research area is clear allowing for finalization of theresearch questions

    With the frame of reference established the focus shifts to gathering informationthrough appropriate resources such as scientific articles and papers Interviews ex-isting documentation of the BWTS and future use of the NN also helps in narrowingdown the solution space to only contain relevant solutions that are optimal for thethesis With a smaller set of NNs to chose from the best suited network structurewill be developed tested and evaluated

    In preparation of constructing and implementing the NN the data have to beprocessed according to the needs of the selected network Pre-processing strate-gies on how datasets are best prepared for NN processing will be investigated andexecuted to ensure that a clear methodology for future processing is executed anddocumented For correctly classifying the current clogging grade or rate of cloggingof the filter state of the art research will be used as reference when commencingwith the labelling of the data

    When the classificationlabelling is done implementation through training and test-ing of the NN can begin Sequentially improvement to the NN structure will bemade by comparing the results of different initial weight conditions layer configu-rations and data partitions The experimental results will be graded in terms ofpredictive accuracy and the estimated errors of each individual parameter will betaken into consideration

    Lastly the validation process can begin to ensure that the initial requirementsfrom the problem description are either met or that they have been investigatedWith the results at hand a conclusion can be presented describing how the system

    3

    CHAPTER 1 INTRODUCTION

    can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

    Figure 11 Proposed methodology for the thesis

    4

    Chapter 2

    Frame of Reference

    This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

    21 Filtration amp Clogging Indicators

    Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

    To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

    211 Basket Filter

    A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

    5

    CHAPTER 2 FRAME OF REFERENCE

    Figure 21 An overview of a basket filter1

    The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

    212 Self-Cleaning Basket Filters

    Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

    1Source httpwwwfilter-technicsbe

    6

    21 FILTRATION amp CLOGGING INDICATORS

    Figure 22 An overview of a basket filter with self-cleaning2

    The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

    213 Manometer

    Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

    When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

    2Source httpwwwdirectindustrycom

    7

    CHAPTER 2 FRAME OF REFERENCE

    214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

    1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

    2 a decrease in Q as a result of an increase in ∆p

    These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

    1 steady state ∆p and Qrarr Nolittle clogging

    2 linear increase in ∆p and steady Qrarr Moderate clogging

    3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

    With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

    Figure 23 Visualization of the clogging states3

    3Source Eker et al [6]

    8

    21 FILTRATION amp CLOGGING INDICATORS

    215 Physics-based Modelling

    The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

    Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

    QL = KA

    microL∆p (21)

    rewritten as

    ∆p = microL

    KAQL (22)

    A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

    ∆p = kVsmicro

    Φ2D2p

    (1minus ε)2L

    ε3(23)

    Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

    ∆p = 150Vsmicro(1minus ε)2L

    D2pε

    3 + 175(1minus ε)ρV 2s L

    ε3Dp(24)

    where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

    Table 21 Variable explanation for Ergunrsquos equation

    Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

    Dp Diameter of the spherical particle mρ Density of the liquid kgm3

    9

    CHAPTER 2 FRAME OF REFERENCE

    Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

    22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

    Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

    Table 22 Outputs of a confusion matrix

    PredictionPositive Negative

    Act

    ual Positive True Positive (TP) False Positive (FP)

    Negative False Negative (FN) True Negative (TN)

    The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

    ACC =sumn

    i=1 jin

    where ji =

    1 if yi = yi

    0 if yi 6= yi

    (25)

    by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

    10

    22 PREDICTIVE ANALYTICS

    In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

    221 Classification Error Metrics

    Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

    Area Under Curve (AUC)

    AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

    sensitivity = TP

    TP + FN(26)

    specificity = TN

    TN + FP(27)

    The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

    F1 Score

    The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

    precision = TP

    TP + FP(28)

    recall = TP

    TP + FN(29)

    F1 = 2times precisiontimes recallprecision+ recall

    (210)

    11

    CHAPTER 2 FRAME OF REFERENCE

    Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

    Logarithmic Loss (Log Loss)

    For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

    LogLoss = minusMsum

    c=1yoclog(poc) (211)

    222 Regression Error Metrics

    Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

    Mean Absolute Error (MAE)

    Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

    MAE = 1n

    nsumi=1|yi minus yi| (212)

    Mean Squared Error (MSE)

    The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

    12

    22 PREDICTIVE ANALYTICS

    MSE = 1n

    nsumi=1

    (yi minus yi)2 (213)

    Root Mean Squared Error (RMSE)

    RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

    RMSE =

    radicradicradicradic 1n

    nsumi=1

    (yi minus yi)2 (214)

    The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

    partRMSE

    partyi= 1radic

    MSE

    partMSE

    partyi(215)

    Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

    Mean Square Percentage Error (MSPE)

    The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

    MSPE = 100n

    nsumi=1

    (yi minus yi

    yi

    )2(216)

    Mean Absolute Percentage Error (MAPE)

    The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

    MAPE = 100n

    nsumi=1

    ∣∣∣∣yi minus yi

    yi

    ∣∣∣∣ (217)

    13

    CHAPTER 2 FRAME OF REFERENCE

    Coefficient of Determination r2

    To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

    r2 =

    sumni=1((yi minus yi)(yi minus yi))2radicsumn

    i=1(yi minus yi)2sumni=1(yi minus yi)2

    2

    (218)

    r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

    Adjusted r2

    Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

    r2adj = 1minus

    [(1minusr2)(nminus1)

    nminuskminus1

    ](219)

    Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

    223 Stochastic Time Series Models

    Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

    Autoregressive Moving Average (ARMA)

    The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

    14

    23 NEURAL NETWORKS

    value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

    Autoregressive Integrated Moving Average (ARIMA)

    ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

    A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

    23 Neural Networks

    231 Overview

    NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

    15

    CHAPTER 2 FRAME OF REFERENCE

    properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

    232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

    output =

    0 if w middot x+ b le 01 if w middot x+ b gt 0

    (220)

    In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

    233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

    Sigmoid Function

    The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

    f(z) = σ(z) = 11 + eminusz

    (221)

    for

    z =sum

    j

    wj middot xj + b (222)

    16

    23 NEURAL NETWORKS

    Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

    Rectified Function

    The rectifier activation function is defined as the positive part of its argument [34]

    f(x) = x+ = max(0 x) (223)

    for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

    Swish Function

    Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

    f(x) = x middot sigmoid(βx) (224)

    where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

    234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

    Shallow Neural Networks (SNN)

    SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

    17

    CHAPTER 2 FRAME OF REFERENCE

    tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

    Deep Neural Networks (DNN)

    DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

    f(x) = f (1) + f (2) + + f (n) (225)

    where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

    Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

    Recurring Neural Networks(RNN)

    Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

    x1 =[0 0 1 1 0 0 0

    ]x2 =

    [0 0 0 1 1 0 0

    ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

    18

    23 NEURAL NETWORKS

    weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

    Long Short Term Memory (LSTM) Networks

    In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

    it = σ(ωi

    [htminus1 xt

    ]+ bi)

    ot = σ(ωo

    [htminus1 xt

    ]+ bo)

    ft = σ(ωf

    [htminus1 xt

    ]+ bf )

    (226)

    The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

    Gated Recurrent Units (GRU)

    GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

    19

    CHAPTER 2 FRAME OF REFERENCE

    Convolutional Neural Networks (CNN)

    The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

    The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

    Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

    Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

    20

    23 NEURAL NETWORKS

    Figure 25 A max pooling layer with pool size 2 pooling an input

    The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

    Figure 26 A flattening layer flattening the feature map

    21

    Chapter 3

    Experimental Development

    This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

    31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

    Figure 31 A complete test cycle

    23

    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

    During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

    Figure 32 A test cycle with the backflush stop cut from the data

    The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

    24

    31 DATA GATHERING AND PROCESSING

    Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

    Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

    Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

    As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

    25

    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

    the amount of data points and respective clogging labels for each test cycle can befound in Table 31

    Table 31 Amount of data available after preprocessing

    Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

    Total 3195 1012 2903

    When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

    32 Model Generation

    In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

    Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

    The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

    26

    32 MODEL GENERATION

    variables The encoding can be done for both integers and tags such as123

    rarr1 0 0

    0 1 00 0 1

    or

    redbluegreen

    rarr1 0 0

    0 1 00 0 1

    so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

    The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

    xi minusmin(x)max(x)minusmin(x) (31)

    Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

    321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

    X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

    ](32)

    X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

    ](33)

    27

    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

    When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

    bull Samples - The amount of data points

    bull Time steps - The points of observation of the samples

    bull Features - The observed variables per time step

    The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

    Figure 35 An overview of the LSTM network architecture

    The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

    322 Regression Processing with the CNN Model

    As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

    28

    32 MODEL GENERATION

    observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

    The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

    Figure 36 An overview of the CNN architecture

    Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

    323 Label Classification

    With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

    For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

    29

    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

    20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

    For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

    33 Model evaluation

    During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

    For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

    For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

    30

    34 HARDWARE SPECIFICATIONS

    Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

    34 Hardware Specifications

    The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

    Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

    31

    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

    The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

    The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

    32

    Chapter 4

    Results

    This chapter presents the results for all the models presented in the previous chapter

    41 LSTM Performance

    Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

    Figure 41 MAE and MSE loss for the LSTM

    33

    CHAPTER 4 RESULTS

    Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

    Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

    Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

    34

    41 LSTM PERFORMANCE

    Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

    Table 41 Evaluation metrics for the LSTM during regression analysis

    Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

    Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

    Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

    35

    CHAPTER 4 RESULTS

    Table 42 Evaluation metrics for the LSTM during classification analysis

    of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

    Table 43 LSTM confusion matrix

    PredictionLabel 1 Label 2

    Act

    ual Label 1 109 1

    Label 2 3 669

    42 CNN Performance

    Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

    Figure 47 MAE and MSE loss for the CNN

    36

    42 CNN PERFORMANCE

    Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

    Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

    Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

    37

    CHAPTER 4 RESULTS

    Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

    Table 44 Evaluation metrics for the CNN during regression analysis

    Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

    Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

    Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

    38

    42 CNN PERFORMANCE

    Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

    Table 45 Evaluation metrics for the CNN during classification analysis

    Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

    Table 46 CNN confusion matrix for data from the MAE regression network

    PredictionLabel 1 Label 2

    Act

    ual Label 1 82 29

    Label 2 38 631

    Table 47 CNN confusion matrix for data from the MSE regression network

    PredictionLabel 1 Label 2

    Act

    ual Label 1 69 41

    Label 2 11 659

    39

    Chapter 5

    Discussion amp Conclusion

    This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

    51 The LSTM Network

    511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

    Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

    The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

    41

    CHAPTER 5 DISCUSSION amp CONCLUSION

    while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

    512 Classification Analysis

    As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

    The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

    52 The CNN

    521 Regression Analysis

    The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

    Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

    42

    52 THE CNN

    is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

    Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

    522 Classification Analysis

    Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

    Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

    However the CNN is still doing a good job at predicting future clogging even

    43

    CHAPTER 5 DISCUSSION amp CONCLUSION

    up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

    53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

    54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

    As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

    44

    Chapter 6

    Future Work

    In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

    For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

    On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

    Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

    45

    Bibliography

    [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

    [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

    [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

    [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

    [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

    [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

    [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

    [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

    [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

    [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

    47

    BIBLIOGRAPHY

    [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

    [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

    [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

    [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

    [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

    [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

    [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

    [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

    [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

    [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

    [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

    48

    BIBLIOGRAPHY

    [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

    [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

    [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

    [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

    [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

    [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

    [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

    [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

    [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

    [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

    [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

    [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

    49

    BIBLIOGRAPHY

    models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

    [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

    [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

    [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

    [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

    [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

    [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

    [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

    [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

    [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

    [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

    50

    BIBLIOGRAPHY

    [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

    [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

    [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

    [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

    [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

    [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

    [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

    51

    TRITA TRITA-ITM-EX 2019606

    wwwkthse

    • Introduction
      • Background
      • Problem Description
      • Purpose Definitions amp Research Questions
      • Scope and Delimitations
      • Method Description
        • Frame of Reference
          • Filtration amp Clogging Indicators
            • Basket Filter
            • Self-Cleaning Basket Filters
            • Manometer
            • The Clogging Phenomena
            • Physics-based Modelling
              • Predictive Analytics
                • Classification Error Metrics
                • Regression Error Metrics
                • Stochastic Time Series Models
                  • Neural Networks
                    • Overview
                    • The Perceptron
                    • Activation functions
                    • Neural Network Architectures
                        • Experimental Development
                          • Data Gathering and Processing
                          • Model Generation
                            • Regression Processing with the LSTM Model
                            • Regression Processing with the CNN Model
                            • Label Classification
                              • Model evaluation
                              • Hardware Specifications
                                • Results
                                  • LSTM Performance
                                  • CNN Performance
                                    • Discussion amp Conclusion
                                      • The LSTM Network
                                        • Regression Analysis
                                        • Classification Analysis
                                          • The CNN
                                            • Regression Analysis
                                            • Classification Analysis
                                              • Comparison Between Both Networks
                                              • Conclusion
                                                • Future Work
                                                • Bibliography

      SammanfattningSedan Ballast Water Management-konventionen introducerades har fartyg behovtanvanda barlastvattensystem for att behandla barlastvattnet i ett forsok att hammaspridningen av bakterier och andra mikrober som kan vara farliga for frammandeecosystem Ett satt att forbehandla vattnet innan behandling ar genom att lata detpassera genom ett filter Nar filtret samlar pa sig partiklar borjar det att kloggaoch kan potentiellt klogga igen snabbt nar koncentrationen av partiklar i vattnet arhog Kloggning kan aventyra systemets sakerhet Det har examensarbetet amnaratt undersoka om maskininlarning genom neurala natvark kan implementeras i sys-temet for att prediktivt bestamma filtrets kloggningsgrad genom att undersokalampligheten hos tva populara natverksstrukturer for tidsserieanalys

      Problemet handlade initialt om att bedomma olika kloggningsgrader for filterele-mentet baserat pa samplade sensordata fran barlastvattensystemet Datan kordessedan for regressionsanalys genom tva neurala natverk ett av typen LSTM ochett av typen CNN for att prediktivt bestamma paramterarna LSTM-natvarketuppskattade variabelvarden och kloggningsgrad for de kommande 5 sekundrarnamedan CNNet uppskattade variabelvarden och kloggningsgrad for de kommande30 sekunderna Den uppskattade datan verifierades sedan genom klassificering avett LSTM natverk och tva CNN

      LSTM natverket for regression uppnadde ett r2-resultat pa 0981 och LSTM natver-ket for klassificering uppnadde en klassificeringsgrad pa 995 CNNet for regres-sion uppnadde ett r2-resultat pa 0876 och CNNet for klassificering uppnadde enklassificeringsgrad pa 933 Resultatet visar att ML kan anvandas for att identi-fiera olika kloggningsgrad men ytterligare forskning kravs for att bestamma om allakloggningsstadier kan klassificeras

      Nomenclature

      ARIMA Autoregressive Integrated Moving Average

      AUC Area Under Curve

      BWTS Ballast Water Treatment System

      CNN Convolutional Neural Network

      FOR Frame of Reference

      LSTM Long Short Term Memory

      ML Machine Learning

      MAE Mean Absolute Error

      MSE Mean Squared Error

      NN Neural Network

      ReLU Rectified Linear Unit

      RMSE Root Mean Squared Error

      TSS Total Suspended Solids

      Contents

      1 Introduction 111 Background 112 Problem Description 113 Purpose Definitions amp Research Questions 214 Scope and Delimitations 215 Method Description 3

      2 Frame of Reference 521 Filtration amp Clogging Indicators 5

      211 Basket Filter 5212 Self-Cleaning Basket Filters 6213 Manometer 7214 The Clogging Phenomena 8215 Physics-based Modelling 9

      22 Predictive Analytics 10221 Classification Error Metrics 11222 Regression Error Metrics 12223 Stochastic Time Series Models 14

      23 Neural Networks 15231 Overview 15232 The Perceptron 16233 Activation functions 16234 Neural Network Architectures 17

      3 Experimental Development 2331 Data Gathering and Processing 2332 Model Generation 26

      321 Regression Processing with the LSTM Model 27322 Regression Processing with the CNN Model 28323 Label Classification 29

      33 Model evaluation 3034 Hardware Specifications 31

      4 Results 3341 LSTM Performance 3342 CNN Performance 36

      5 Discussion amp Conclusion 4151 The LSTM Network 41

      511 Regression Analysis 41512 Classification Analysis 42

      52 The CNN 42521 Regression Analysis 42522 Classification Analysis 43

      53 Comparison Between Both Networks 4454 Conclusion 44

      6 Future Work 45

      Bibliography 47

      Chapter 1

      Introduction

      11 Background

      Ballast water tanks are used on ships to stabilize the ship for different shippingloads When a ship isnrsquot fully loaded or when a ship must run at a sufficient depthwater is pumped into the ballast water tanks through a water pumping system Topreserve existing ecosystems as well as preventing the spread of bacteria larvaeor other microbes ballast water management is regulated world-wide by the Inter-national Convention for the Control and Management of Shipsrsquo Ballast Water andSediments (BWM convention)

      PureBallast 3 (PB3) is a ballast water treatment system (BWTS) designed by AlfaLaval that works as an extension to the shipsrsquo water pumping system The BWTSuses a filter for physical separation of organisms and total suspended solids (TSS)and a UV reactor for main treatment of the ballast water As PB3 can be installedon a variety of ships the BWTS must be able to process different waters underdifferent conditions to fulfill the requirements of the BWM convention

      12 Problem Description

      In the existing system there is currently no way of detecting if the filter in use isabout to clog A clogged filter forces a stop for the entire process and the onlyway to get the BWTS functional again is to physically remove the filter disas-semble it clean it reassemble it and put it back This cleaning process involvesrisks of damaging the filter is expensive to carry out and takes up unnecessary time

      Furthermore due to the different concentrations of the TSS in waters around theworld as well as different supply flows and supply pressures from various shippumping system the load on the BWTS may vary greatly This imposes problemsin estimating actual system health when using a static model as there are discrep-ancies introduced in the measured parameters

      1

      CHAPTER 1 INTRODUCTION

      These problems make it impossible to achieve optimal operability while ensuringthat the BWTS runs safely in every environment A desired solution is to developa method which can analyze and react to changes imposed to the system on thefly One such solution could be to use the large amounts of data generated by thesystem to create a neural network (NN) to estimate the state of the BWTS

      13 Purpose Definitions amp Research Questions

      The use of machine learning (ML) in this type of system is a new application of theotherwise popular statistical tool As there is no existing public information on howsuch an implementation can be done the focus of the thesis and the main researchquestion is

      bull To investigate and evaluate the possibility of using ML for predictively esti-mating filter clogging with focus on maritime systems

      An NN model will be developed and evaluated in terms of how accurately it canpredict the clogging of the filter in the BWTS using the systems sensor data Theimplementation of an NN into a system like this is the first of its kind and willrequire analysis and understanding of the system to ensure a working realizationTherefore with the BWTS in mind the original proposed research question can bespecified further to be

      bull How can an NN be integrated with a BWTS to estimate clogging of a basketfilter

      14 Scope and Delimitations

      In comparison to the initial thesis scope provided in Appendix A some delimita-tions have been made First and foremost the only part of interest of the BWTSis the filter When designing the NN the data used will be from in-house testingas well as from the cloud service Connectivity so no alternations to the existinghardware and software can be made For the purpose of focusing on the ML modeland the system sensors all data are assumed to be constantly available

      It is also not possible to test different kinds of ML methods or a plethora of NNswithin the time frame of the thesis For that reason the decision of developing anNN is based on its success in predictive analytics in other fields[1 2] and the frameof reference for deciding the most appropriate NN will depend on the characteristicsof the data and the future requirements of the NN

      2

      15 METHOD DESCRIPTION

      15 Method Description

      The task at hand is a combination of two On one hand it is an engineering taskinvestigating if an NN can be developed for the BWTS On the other hand it is ascientific task that focuses on applying and evaluating current research in the areasof NNs water filtration techniques time series analysis and predictive analytics tothe problem To complete both tasks within the time frame a methodology has tobe developed to ensure that the engineering task is done the research questions areanswered and that there is a clear foundation for future research The methodologyis visualized in Figure 11

      The basis of the methodology starts with the problem description The problemdescription makes way for establishing the first frame of reference (FOR) and rais-ing the initial potential research questions Following that a better understandingabout the research field in question and continuous discussions with Alfa Laval willhelp adapt the FOR further Working through this process iteratively a final framecan be established where the research area is clear allowing for finalization of theresearch questions

      With the frame of reference established the focus shifts to gathering informationthrough appropriate resources such as scientific articles and papers Interviews ex-isting documentation of the BWTS and future use of the NN also helps in narrowingdown the solution space to only contain relevant solutions that are optimal for thethesis With a smaller set of NNs to chose from the best suited network structurewill be developed tested and evaluated

      In preparation of constructing and implementing the NN the data have to beprocessed according to the needs of the selected network Pre-processing strate-gies on how datasets are best prepared for NN processing will be investigated andexecuted to ensure that a clear methodology for future processing is executed anddocumented For correctly classifying the current clogging grade or rate of cloggingof the filter state of the art research will be used as reference when commencingwith the labelling of the data

      When the classificationlabelling is done implementation through training and test-ing of the NN can begin Sequentially improvement to the NN structure will bemade by comparing the results of different initial weight conditions layer configu-rations and data partitions The experimental results will be graded in terms ofpredictive accuracy and the estimated errors of each individual parameter will betaken into consideration

      Lastly the validation process can begin to ensure that the initial requirementsfrom the problem description are either met or that they have been investigatedWith the results at hand a conclusion can be presented describing how the system

      3

      CHAPTER 1 INTRODUCTION

      can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

      Figure 11 Proposed methodology for the thesis

      4

      Chapter 2

      Frame of Reference

      This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

      21 Filtration amp Clogging Indicators

      Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

      To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

      211 Basket Filter

      A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

      5

      CHAPTER 2 FRAME OF REFERENCE

      Figure 21 An overview of a basket filter1

      The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

      212 Self-Cleaning Basket Filters

      Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

      1Source httpwwwfilter-technicsbe

      6

      21 FILTRATION amp CLOGGING INDICATORS

      Figure 22 An overview of a basket filter with self-cleaning2

      The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

      213 Manometer

      Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

      When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

      2Source httpwwwdirectindustrycom

      7

      CHAPTER 2 FRAME OF REFERENCE

      214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

      1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

      2 a decrease in Q as a result of an increase in ∆p

      These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

      1 steady state ∆p and Qrarr Nolittle clogging

      2 linear increase in ∆p and steady Qrarr Moderate clogging

      3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

      With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

      Figure 23 Visualization of the clogging states3

      3Source Eker et al [6]

      8

      21 FILTRATION amp CLOGGING INDICATORS

      215 Physics-based Modelling

      The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

      Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

      QL = KA

      microL∆p (21)

      rewritten as

      ∆p = microL

      KAQL (22)

      A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

      ∆p = kVsmicro

      Φ2D2p

      (1minus ε)2L

      ε3(23)

      Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

      ∆p = 150Vsmicro(1minus ε)2L

      D2pε

      3 + 175(1minus ε)ρV 2s L

      ε3Dp(24)

      where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

      Table 21 Variable explanation for Ergunrsquos equation

      Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

      Dp Diameter of the spherical particle mρ Density of the liquid kgm3

      9

      CHAPTER 2 FRAME OF REFERENCE

      Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

      22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

      Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

      Table 22 Outputs of a confusion matrix

      PredictionPositive Negative

      Act

      ual Positive True Positive (TP) False Positive (FP)

      Negative False Negative (FN) True Negative (TN)

      The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

      ACC =sumn

      i=1 jin

      where ji =

      1 if yi = yi

      0 if yi 6= yi

      (25)

      by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

      10

      22 PREDICTIVE ANALYTICS

      In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

      221 Classification Error Metrics

      Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

      Area Under Curve (AUC)

      AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

      sensitivity = TP

      TP + FN(26)

      specificity = TN

      TN + FP(27)

      The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

      F1 Score

      The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

      precision = TP

      TP + FP(28)

      recall = TP

      TP + FN(29)

      F1 = 2times precisiontimes recallprecision+ recall

      (210)

      11

      CHAPTER 2 FRAME OF REFERENCE

      Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

      Logarithmic Loss (Log Loss)

      For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

      LogLoss = minusMsum

      c=1yoclog(poc) (211)

      222 Regression Error Metrics

      Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

      Mean Absolute Error (MAE)

      Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

      MAE = 1n

      nsumi=1|yi minus yi| (212)

      Mean Squared Error (MSE)

      The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

      12

      22 PREDICTIVE ANALYTICS

      MSE = 1n

      nsumi=1

      (yi minus yi)2 (213)

      Root Mean Squared Error (RMSE)

      RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

      RMSE =

      radicradicradicradic 1n

      nsumi=1

      (yi minus yi)2 (214)

      The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

      partRMSE

      partyi= 1radic

      MSE

      partMSE

      partyi(215)

      Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

      Mean Square Percentage Error (MSPE)

      The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

      MSPE = 100n

      nsumi=1

      (yi minus yi

      yi

      )2(216)

      Mean Absolute Percentage Error (MAPE)

      The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

      MAPE = 100n

      nsumi=1

      ∣∣∣∣yi minus yi

      yi

      ∣∣∣∣ (217)

      13

      CHAPTER 2 FRAME OF REFERENCE

      Coefficient of Determination r2

      To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

      r2 =

      sumni=1((yi minus yi)(yi minus yi))2radicsumn

      i=1(yi minus yi)2sumni=1(yi minus yi)2

      2

      (218)

      r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

      Adjusted r2

      Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

      r2adj = 1minus

      [(1minusr2)(nminus1)

      nminuskminus1

      ](219)

      Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

      223 Stochastic Time Series Models

      Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

      Autoregressive Moving Average (ARMA)

      The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

      14

      23 NEURAL NETWORKS

      value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

      Autoregressive Integrated Moving Average (ARIMA)

      ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

      A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

      23 Neural Networks

      231 Overview

      NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

      15

      CHAPTER 2 FRAME OF REFERENCE

      properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

      232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

      output =

      0 if w middot x+ b le 01 if w middot x+ b gt 0

      (220)

      In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

      233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

      Sigmoid Function

      The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

      f(z) = σ(z) = 11 + eminusz

      (221)

      for

      z =sum

      j

      wj middot xj + b (222)

      16

      23 NEURAL NETWORKS

      Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

      Rectified Function

      The rectifier activation function is defined as the positive part of its argument [34]

      f(x) = x+ = max(0 x) (223)

      for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

      Swish Function

      Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

      f(x) = x middot sigmoid(βx) (224)

      where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

      234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

      Shallow Neural Networks (SNN)

      SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

      17

      CHAPTER 2 FRAME OF REFERENCE

      tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

      Deep Neural Networks (DNN)

      DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

      f(x) = f (1) + f (2) + + f (n) (225)

      where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

      Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

      Recurring Neural Networks(RNN)

      Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

      x1 =[0 0 1 1 0 0 0

      ]x2 =

      [0 0 0 1 1 0 0

      ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

      18

      23 NEURAL NETWORKS

      weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

      Long Short Term Memory (LSTM) Networks

      In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

      it = σ(ωi

      [htminus1 xt

      ]+ bi)

      ot = σ(ωo

      [htminus1 xt

      ]+ bo)

      ft = σ(ωf

      [htminus1 xt

      ]+ bf )

      (226)

      The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

      Gated Recurrent Units (GRU)

      GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

      19

      CHAPTER 2 FRAME OF REFERENCE

      Convolutional Neural Networks (CNN)

      The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

      The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

      Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

      Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

      20

      23 NEURAL NETWORKS

      Figure 25 A max pooling layer with pool size 2 pooling an input

      The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

      Figure 26 A flattening layer flattening the feature map

      21

      Chapter 3

      Experimental Development

      This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

      31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

      Figure 31 A complete test cycle

      23

      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

      During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

      Figure 32 A test cycle with the backflush stop cut from the data

      The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

      24

      31 DATA GATHERING AND PROCESSING

      Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

      Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

      Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

      As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

      25

      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

      the amount of data points and respective clogging labels for each test cycle can befound in Table 31

      Table 31 Amount of data available after preprocessing

      Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

      Total 3195 1012 2903

      When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

      32 Model Generation

      In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

      Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

      The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

      26

      32 MODEL GENERATION

      variables The encoding can be done for both integers and tags such as123

      rarr1 0 0

      0 1 00 0 1

      or

      redbluegreen

      rarr1 0 0

      0 1 00 0 1

      so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

      The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

      xi minusmin(x)max(x)minusmin(x) (31)

      Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

      321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

      X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

      ](32)

      X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

      ](33)

      27

      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

      When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

      bull Samples - The amount of data points

      bull Time steps - The points of observation of the samples

      bull Features - The observed variables per time step

      The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

      Figure 35 An overview of the LSTM network architecture

      The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

      322 Regression Processing with the CNN Model

      As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

      28

      32 MODEL GENERATION

      observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

      The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

      Figure 36 An overview of the CNN architecture

      Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

      323 Label Classification

      With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

      For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

      29

      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

      20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

      For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

      33 Model evaluation

      During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

      For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

      For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

      30

      34 HARDWARE SPECIFICATIONS

      Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

      34 Hardware Specifications

      The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

      Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

      31

      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

      The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

      The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

      32

      Chapter 4

      Results

      This chapter presents the results for all the models presented in the previous chapter

      41 LSTM Performance

      Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

      Figure 41 MAE and MSE loss for the LSTM

      33

      CHAPTER 4 RESULTS

      Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

      Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

      Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

      34

      41 LSTM PERFORMANCE

      Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

      Table 41 Evaluation metrics for the LSTM during regression analysis

      Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

      Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

      Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

      35

      CHAPTER 4 RESULTS

      Table 42 Evaluation metrics for the LSTM during classification analysis

      of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

      Table 43 LSTM confusion matrix

      PredictionLabel 1 Label 2

      Act

      ual Label 1 109 1

      Label 2 3 669

      42 CNN Performance

      Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

      Figure 47 MAE and MSE loss for the CNN

      36

      42 CNN PERFORMANCE

      Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

      Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

      Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

      37

      CHAPTER 4 RESULTS

      Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

      Table 44 Evaluation metrics for the CNN during regression analysis

      Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

      Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

      Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

      38

      42 CNN PERFORMANCE

      Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

      Table 45 Evaluation metrics for the CNN during classification analysis

      Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

      Table 46 CNN confusion matrix for data from the MAE regression network

      PredictionLabel 1 Label 2

      Act

      ual Label 1 82 29

      Label 2 38 631

      Table 47 CNN confusion matrix for data from the MSE regression network

      PredictionLabel 1 Label 2

      Act

      ual Label 1 69 41

      Label 2 11 659

      39

      Chapter 5

      Discussion amp Conclusion

      This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

      51 The LSTM Network

      511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

      Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

      The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

      41

      CHAPTER 5 DISCUSSION amp CONCLUSION

      while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

      512 Classification Analysis

      As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

      The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

      52 The CNN

      521 Regression Analysis

      The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

      Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

      42

      52 THE CNN

      is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

      Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

      522 Classification Analysis

      Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

      Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

      However the CNN is still doing a good job at predicting future clogging even

      43

      CHAPTER 5 DISCUSSION amp CONCLUSION

      up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

      53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

      54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

      As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

      44

      Chapter 6

      Future Work

      In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

      For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

      On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

      Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

      45

      Bibliography

      [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

      [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

      [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

      [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

      [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

      [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

      [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

      [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

      [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

      [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

      47

      BIBLIOGRAPHY

      [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

      [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

      [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

      [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

      [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

      [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

      [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

      [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

      [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

      [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

      [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

      48

      BIBLIOGRAPHY

      [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

      [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

      [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

      [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

      [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

      [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

      [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

      [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

      [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

      [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

      [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

      [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

      49

      BIBLIOGRAPHY

      models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

      [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

      [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

      [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

      [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

      [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

      [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

      [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

      [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

      [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

      [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

      50

      BIBLIOGRAPHY

      [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

      [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

      [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

      [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

      [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

      [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

      [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

      51

      TRITA TRITA-ITM-EX 2019606

      wwwkthse

      • Introduction
        • Background
        • Problem Description
        • Purpose Definitions amp Research Questions
        • Scope and Delimitations
        • Method Description
          • Frame of Reference
            • Filtration amp Clogging Indicators
              • Basket Filter
              • Self-Cleaning Basket Filters
              • Manometer
              • The Clogging Phenomena
              • Physics-based Modelling
                • Predictive Analytics
                  • Classification Error Metrics
                  • Regression Error Metrics
                  • Stochastic Time Series Models
                    • Neural Networks
                      • Overview
                      • The Perceptron
                      • Activation functions
                      • Neural Network Architectures
                          • Experimental Development
                            • Data Gathering and Processing
                            • Model Generation
                              • Regression Processing with the LSTM Model
                              • Regression Processing with the CNN Model
                              • Label Classification
                                • Model evaluation
                                • Hardware Specifications
                                  • Results
                                    • LSTM Performance
                                    • CNN Performance
                                      • Discussion amp Conclusion
                                        • The LSTM Network
                                          • Regression Analysis
                                          • Classification Analysis
                                            • The CNN
                                              • Regression Analysis
                                              • Classification Analysis
                                                • Comparison Between Both Networks
                                                • Conclusion
                                                  • Future Work
                                                  • Bibliography

        Nomenclature

        ARIMA Autoregressive Integrated Moving Average

        AUC Area Under Curve

        BWTS Ballast Water Treatment System

        CNN Convolutional Neural Network

        FOR Frame of Reference

        LSTM Long Short Term Memory

        ML Machine Learning

        MAE Mean Absolute Error

        MSE Mean Squared Error

        NN Neural Network

        ReLU Rectified Linear Unit

        RMSE Root Mean Squared Error

        TSS Total Suspended Solids

        Contents

        1 Introduction 111 Background 112 Problem Description 113 Purpose Definitions amp Research Questions 214 Scope and Delimitations 215 Method Description 3

        2 Frame of Reference 521 Filtration amp Clogging Indicators 5

        211 Basket Filter 5212 Self-Cleaning Basket Filters 6213 Manometer 7214 The Clogging Phenomena 8215 Physics-based Modelling 9

        22 Predictive Analytics 10221 Classification Error Metrics 11222 Regression Error Metrics 12223 Stochastic Time Series Models 14

        23 Neural Networks 15231 Overview 15232 The Perceptron 16233 Activation functions 16234 Neural Network Architectures 17

        3 Experimental Development 2331 Data Gathering and Processing 2332 Model Generation 26

        321 Regression Processing with the LSTM Model 27322 Regression Processing with the CNN Model 28323 Label Classification 29

        33 Model evaluation 3034 Hardware Specifications 31

        4 Results 3341 LSTM Performance 3342 CNN Performance 36

        5 Discussion amp Conclusion 4151 The LSTM Network 41

        511 Regression Analysis 41512 Classification Analysis 42

        52 The CNN 42521 Regression Analysis 42522 Classification Analysis 43

        53 Comparison Between Both Networks 4454 Conclusion 44

        6 Future Work 45

        Bibliography 47

        Chapter 1

        Introduction

        11 Background

        Ballast water tanks are used on ships to stabilize the ship for different shippingloads When a ship isnrsquot fully loaded or when a ship must run at a sufficient depthwater is pumped into the ballast water tanks through a water pumping system Topreserve existing ecosystems as well as preventing the spread of bacteria larvaeor other microbes ballast water management is regulated world-wide by the Inter-national Convention for the Control and Management of Shipsrsquo Ballast Water andSediments (BWM convention)

        PureBallast 3 (PB3) is a ballast water treatment system (BWTS) designed by AlfaLaval that works as an extension to the shipsrsquo water pumping system The BWTSuses a filter for physical separation of organisms and total suspended solids (TSS)and a UV reactor for main treatment of the ballast water As PB3 can be installedon a variety of ships the BWTS must be able to process different waters underdifferent conditions to fulfill the requirements of the BWM convention

        12 Problem Description

        In the existing system there is currently no way of detecting if the filter in use isabout to clog A clogged filter forces a stop for the entire process and the onlyway to get the BWTS functional again is to physically remove the filter disas-semble it clean it reassemble it and put it back This cleaning process involvesrisks of damaging the filter is expensive to carry out and takes up unnecessary time

        Furthermore due to the different concentrations of the TSS in waters around theworld as well as different supply flows and supply pressures from various shippumping system the load on the BWTS may vary greatly This imposes problemsin estimating actual system health when using a static model as there are discrep-ancies introduced in the measured parameters

        1

        CHAPTER 1 INTRODUCTION

        These problems make it impossible to achieve optimal operability while ensuringthat the BWTS runs safely in every environment A desired solution is to developa method which can analyze and react to changes imposed to the system on thefly One such solution could be to use the large amounts of data generated by thesystem to create a neural network (NN) to estimate the state of the BWTS

        13 Purpose Definitions amp Research Questions

        The use of machine learning (ML) in this type of system is a new application of theotherwise popular statistical tool As there is no existing public information on howsuch an implementation can be done the focus of the thesis and the main researchquestion is

        bull To investigate and evaluate the possibility of using ML for predictively esti-mating filter clogging with focus on maritime systems

        An NN model will be developed and evaluated in terms of how accurately it canpredict the clogging of the filter in the BWTS using the systems sensor data Theimplementation of an NN into a system like this is the first of its kind and willrequire analysis and understanding of the system to ensure a working realizationTherefore with the BWTS in mind the original proposed research question can bespecified further to be

        bull How can an NN be integrated with a BWTS to estimate clogging of a basketfilter

        14 Scope and Delimitations

        In comparison to the initial thesis scope provided in Appendix A some delimita-tions have been made First and foremost the only part of interest of the BWTSis the filter When designing the NN the data used will be from in-house testingas well as from the cloud service Connectivity so no alternations to the existinghardware and software can be made For the purpose of focusing on the ML modeland the system sensors all data are assumed to be constantly available

        It is also not possible to test different kinds of ML methods or a plethora of NNswithin the time frame of the thesis For that reason the decision of developing anNN is based on its success in predictive analytics in other fields[1 2] and the frameof reference for deciding the most appropriate NN will depend on the characteristicsof the data and the future requirements of the NN

        2

        15 METHOD DESCRIPTION

        15 Method Description

        The task at hand is a combination of two On one hand it is an engineering taskinvestigating if an NN can be developed for the BWTS On the other hand it is ascientific task that focuses on applying and evaluating current research in the areasof NNs water filtration techniques time series analysis and predictive analytics tothe problem To complete both tasks within the time frame a methodology has tobe developed to ensure that the engineering task is done the research questions areanswered and that there is a clear foundation for future research The methodologyis visualized in Figure 11

        The basis of the methodology starts with the problem description The problemdescription makes way for establishing the first frame of reference (FOR) and rais-ing the initial potential research questions Following that a better understandingabout the research field in question and continuous discussions with Alfa Laval willhelp adapt the FOR further Working through this process iteratively a final framecan be established where the research area is clear allowing for finalization of theresearch questions

        With the frame of reference established the focus shifts to gathering informationthrough appropriate resources such as scientific articles and papers Interviews ex-isting documentation of the BWTS and future use of the NN also helps in narrowingdown the solution space to only contain relevant solutions that are optimal for thethesis With a smaller set of NNs to chose from the best suited network structurewill be developed tested and evaluated

        In preparation of constructing and implementing the NN the data have to beprocessed according to the needs of the selected network Pre-processing strate-gies on how datasets are best prepared for NN processing will be investigated andexecuted to ensure that a clear methodology for future processing is executed anddocumented For correctly classifying the current clogging grade or rate of cloggingof the filter state of the art research will be used as reference when commencingwith the labelling of the data

        When the classificationlabelling is done implementation through training and test-ing of the NN can begin Sequentially improvement to the NN structure will bemade by comparing the results of different initial weight conditions layer configu-rations and data partitions The experimental results will be graded in terms ofpredictive accuracy and the estimated errors of each individual parameter will betaken into consideration

        Lastly the validation process can begin to ensure that the initial requirementsfrom the problem description are either met or that they have been investigatedWith the results at hand a conclusion can be presented describing how the system

        3

        CHAPTER 1 INTRODUCTION

        can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

        Figure 11 Proposed methodology for the thesis

        4

        Chapter 2

        Frame of Reference

        This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

        21 Filtration amp Clogging Indicators

        Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

        To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

        211 Basket Filter

        A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

        5

        CHAPTER 2 FRAME OF REFERENCE

        Figure 21 An overview of a basket filter1

        The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

        212 Self-Cleaning Basket Filters

        Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

        1Source httpwwwfilter-technicsbe

        6

        21 FILTRATION amp CLOGGING INDICATORS

        Figure 22 An overview of a basket filter with self-cleaning2

        The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

        213 Manometer

        Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

        When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

        2Source httpwwwdirectindustrycom

        7

        CHAPTER 2 FRAME OF REFERENCE

        214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

        1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

        2 a decrease in Q as a result of an increase in ∆p

        These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

        1 steady state ∆p and Qrarr Nolittle clogging

        2 linear increase in ∆p and steady Qrarr Moderate clogging

        3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

        With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

        Figure 23 Visualization of the clogging states3

        3Source Eker et al [6]

        8

        21 FILTRATION amp CLOGGING INDICATORS

        215 Physics-based Modelling

        The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

        Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

        QL = KA

        microL∆p (21)

        rewritten as

        ∆p = microL

        KAQL (22)

        A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

        ∆p = kVsmicro

        Φ2D2p

        (1minus ε)2L

        ε3(23)

        Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

        ∆p = 150Vsmicro(1minus ε)2L

        D2pε

        3 + 175(1minus ε)ρV 2s L

        ε3Dp(24)

        where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

        Table 21 Variable explanation for Ergunrsquos equation

        Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

        Dp Diameter of the spherical particle mρ Density of the liquid kgm3

        9

        CHAPTER 2 FRAME OF REFERENCE

        Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

        22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

        Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

        Table 22 Outputs of a confusion matrix

        PredictionPositive Negative

        Act

        ual Positive True Positive (TP) False Positive (FP)

        Negative False Negative (FN) True Negative (TN)

        The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

        ACC =sumn

        i=1 jin

        where ji =

        1 if yi = yi

        0 if yi 6= yi

        (25)

        by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

        10

        22 PREDICTIVE ANALYTICS

        In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

        221 Classification Error Metrics

        Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

        Area Under Curve (AUC)

        AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

        sensitivity = TP

        TP + FN(26)

        specificity = TN

        TN + FP(27)

        The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

        F1 Score

        The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

        precision = TP

        TP + FP(28)

        recall = TP

        TP + FN(29)

        F1 = 2times precisiontimes recallprecision+ recall

        (210)

        11

        CHAPTER 2 FRAME OF REFERENCE

        Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

        Logarithmic Loss (Log Loss)

        For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

        LogLoss = minusMsum

        c=1yoclog(poc) (211)

        222 Regression Error Metrics

        Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

        Mean Absolute Error (MAE)

        Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

        MAE = 1n

        nsumi=1|yi minus yi| (212)

        Mean Squared Error (MSE)

        The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

        12

        22 PREDICTIVE ANALYTICS

        MSE = 1n

        nsumi=1

        (yi minus yi)2 (213)

        Root Mean Squared Error (RMSE)

        RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

        RMSE =

        radicradicradicradic 1n

        nsumi=1

        (yi minus yi)2 (214)

        The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

        partRMSE

        partyi= 1radic

        MSE

        partMSE

        partyi(215)

        Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

        Mean Square Percentage Error (MSPE)

        The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

        MSPE = 100n

        nsumi=1

        (yi minus yi

        yi

        )2(216)

        Mean Absolute Percentage Error (MAPE)

        The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

        MAPE = 100n

        nsumi=1

        ∣∣∣∣yi minus yi

        yi

        ∣∣∣∣ (217)

        13

        CHAPTER 2 FRAME OF REFERENCE

        Coefficient of Determination r2

        To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

        r2 =

        sumni=1((yi minus yi)(yi minus yi))2radicsumn

        i=1(yi minus yi)2sumni=1(yi minus yi)2

        2

        (218)

        r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

        Adjusted r2

        Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

        r2adj = 1minus

        [(1minusr2)(nminus1)

        nminuskminus1

        ](219)

        Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

        223 Stochastic Time Series Models

        Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

        Autoregressive Moving Average (ARMA)

        The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

        14

        23 NEURAL NETWORKS

        value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

        Autoregressive Integrated Moving Average (ARIMA)

        ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

        A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

        23 Neural Networks

        231 Overview

        NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

        15

        CHAPTER 2 FRAME OF REFERENCE

        properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

        232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

        output =

        0 if w middot x+ b le 01 if w middot x+ b gt 0

        (220)

        In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

        233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

        Sigmoid Function

        The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

        f(z) = σ(z) = 11 + eminusz

        (221)

        for

        z =sum

        j

        wj middot xj + b (222)

        16

        23 NEURAL NETWORKS

        Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

        Rectified Function

        The rectifier activation function is defined as the positive part of its argument [34]

        f(x) = x+ = max(0 x) (223)

        for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

        Swish Function

        Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

        f(x) = x middot sigmoid(βx) (224)

        where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

        234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

        Shallow Neural Networks (SNN)

        SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

        17

        CHAPTER 2 FRAME OF REFERENCE

        tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

        Deep Neural Networks (DNN)

        DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

        f(x) = f (1) + f (2) + + f (n) (225)

        where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

        Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

        Recurring Neural Networks(RNN)

        Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

        x1 =[0 0 1 1 0 0 0

        ]x2 =

        [0 0 0 1 1 0 0

        ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

        18

        23 NEURAL NETWORKS

        weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

        Long Short Term Memory (LSTM) Networks

        In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

        it = σ(ωi

        [htminus1 xt

        ]+ bi)

        ot = σ(ωo

        [htminus1 xt

        ]+ bo)

        ft = σ(ωf

        [htminus1 xt

        ]+ bf )

        (226)

        The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

        Gated Recurrent Units (GRU)

        GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

        19

        CHAPTER 2 FRAME OF REFERENCE

        Convolutional Neural Networks (CNN)

        The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

        The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

        Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

        Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

        20

        23 NEURAL NETWORKS

        Figure 25 A max pooling layer with pool size 2 pooling an input

        The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

        Figure 26 A flattening layer flattening the feature map

        21

        Chapter 3

        Experimental Development

        This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

        31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

        Figure 31 A complete test cycle

        23

        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

        During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

        Figure 32 A test cycle with the backflush stop cut from the data

        The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

        24

        31 DATA GATHERING AND PROCESSING

        Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

        Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

        Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

        As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

        25

        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

        the amount of data points and respective clogging labels for each test cycle can befound in Table 31

        Table 31 Amount of data available after preprocessing

        Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

        Total 3195 1012 2903

        When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

        32 Model Generation

        In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

        Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

        The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

        26

        32 MODEL GENERATION

        variables The encoding can be done for both integers and tags such as123

        rarr1 0 0

        0 1 00 0 1

        or

        redbluegreen

        rarr1 0 0

        0 1 00 0 1

        so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

        The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

        xi minusmin(x)max(x)minusmin(x) (31)

        Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

        321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

        X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

        ](32)

        X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

        ](33)

        27

        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

        When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

        bull Samples - The amount of data points

        bull Time steps - The points of observation of the samples

        bull Features - The observed variables per time step

        The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

        Figure 35 An overview of the LSTM network architecture

        The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

        322 Regression Processing with the CNN Model

        As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

        28

        32 MODEL GENERATION

        observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

        The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

        Figure 36 An overview of the CNN architecture

        Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

        323 Label Classification

        With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

        For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

        29

        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

        20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

        For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

        33 Model evaluation

        During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

        For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

        For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

        30

        34 HARDWARE SPECIFICATIONS

        Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

        34 Hardware Specifications

        The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

        Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

        31

        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

        The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

        The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

        32

        Chapter 4

        Results

        This chapter presents the results for all the models presented in the previous chapter

        41 LSTM Performance

        Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

        Figure 41 MAE and MSE loss for the LSTM

        33

        CHAPTER 4 RESULTS

        Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

        Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

        Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

        34

        41 LSTM PERFORMANCE

        Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

        Table 41 Evaluation metrics for the LSTM during regression analysis

        Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

        Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

        Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

        35

        CHAPTER 4 RESULTS

        Table 42 Evaluation metrics for the LSTM during classification analysis

        of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

        Table 43 LSTM confusion matrix

        PredictionLabel 1 Label 2

        Act

        ual Label 1 109 1

        Label 2 3 669

        42 CNN Performance

        Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

        Figure 47 MAE and MSE loss for the CNN

        36

        42 CNN PERFORMANCE

        Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

        Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

        Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

        37

        CHAPTER 4 RESULTS

        Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

        Table 44 Evaluation metrics for the CNN during regression analysis

        Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

        Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

        Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

        38

        42 CNN PERFORMANCE

        Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

        Table 45 Evaluation metrics for the CNN during classification analysis

        Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

        Table 46 CNN confusion matrix for data from the MAE regression network

        PredictionLabel 1 Label 2

        Act

        ual Label 1 82 29

        Label 2 38 631

        Table 47 CNN confusion matrix for data from the MSE regression network

        PredictionLabel 1 Label 2

        Act

        ual Label 1 69 41

        Label 2 11 659

        39

        Chapter 5

        Discussion amp Conclusion

        This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

        51 The LSTM Network

        511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

        Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

        The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

        41

        CHAPTER 5 DISCUSSION amp CONCLUSION

        while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

        512 Classification Analysis

        As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

        The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

        52 The CNN

        521 Regression Analysis

        The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

        Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

        42

        52 THE CNN

        is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

        Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

        522 Classification Analysis

        Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

        Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

        However the CNN is still doing a good job at predicting future clogging even

        43

        CHAPTER 5 DISCUSSION amp CONCLUSION

        up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

        53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

        54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

        As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

        44

        Chapter 6

        Future Work

        In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

        For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

        On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

        Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

        45

        Bibliography

        [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

        [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

        [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

        [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

        [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

        [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

        [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

        [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

        [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

        [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

        47

        BIBLIOGRAPHY

        [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

        [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

        [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

        [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

        [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

        [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

        [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

        [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

        [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

        [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

        [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

        48

        BIBLIOGRAPHY

        [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

        [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

        [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

        [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

        [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

        [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

        [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

        [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

        [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

        [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

        [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

        [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

        49

        BIBLIOGRAPHY

        models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

        [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

        [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

        [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

        [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

        [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

        [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

        [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

        [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

        [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

        [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

        50

        BIBLIOGRAPHY

        [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

        [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

        [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

        [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

        [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

        [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

        [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

        51

        TRITA TRITA-ITM-EX 2019606

        wwwkthse

        • Introduction
          • Background
          • Problem Description
          • Purpose Definitions amp Research Questions
          • Scope and Delimitations
          • Method Description
            • Frame of Reference
              • Filtration amp Clogging Indicators
                • Basket Filter
                • Self-Cleaning Basket Filters
                • Manometer
                • The Clogging Phenomena
                • Physics-based Modelling
                  • Predictive Analytics
                    • Classification Error Metrics
                    • Regression Error Metrics
                    • Stochastic Time Series Models
                      • Neural Networks
                        • Overview
                        • The Perceptron
                        • Activation functions
                        • Neural Network Architectures
                            • Experimental Development
                              • Data Gathering and Processing
                              • Model Generation
                                • Regression Processing with the LSTM Model
                                • Regression Processing with the CNN Model
                                • Label Classification
                                  • Model evaluation
                                  • Hardware Specifications
                                    • Results
                                      • LSTM Performance
                                      • CNN Performance
                                        • Discussion amp Conclusion
                                          • The LSTM Network
                                            • Regression Analysis
                                            • Classification Analysis
                                              • The CNN
                                                • Regression Analysis
                                                • Classification Analysis
                                                  • Comparison Between Both Networks
                                                  • Conclusion
                                                    • Future Work
                                                    • Bibliography

          Contents

          1 Introduction 111 Background 112 Problem Description 113 Purpose Definitions amp Research Questions 214 Scope and Delimitations 215 Method Description 3

          2 Frame of Reference 521 Filtration amp Clogging Indicators 5

          211 Basket Filter 5212 Self-Cleaning Basket Filters 6213 Manometer 7214 The Clogging Phenomena 8215 Physics-based Modelling 9

          22 Predictive Analytics 10221 Classification Error Metrics 11222 Regression Error Metrics 12223 Stochastic Time Series Models 14

          23 Neural Networks 15231 Overview 15232 The Perceptron 16233 Activation functions 16234 Neural Network Architectures 17

          3 Experimental Development 2331 Data Gathering and Processing 2332 Model Generation 26

          321 Regression Processing with the LSTM Model 27322 Regression Processing with the CNN Model 28323 Label Classification 29

          33 Model evaluation 3034 Hardware Specifications 31

          4 Results 3341 LSTM Performance 3342 CNN Performance 36

          5 Discussion amp Conclusion 4151 The LSTM Network 41

          511 Regression Analysis 41512 Classification Analysis 42

          52 The CNN 42521 Regression Analysis 42522 Classification Analysis 43

          53 Comparison Between Both Networks 4454 Conclusion 44

          6 Future Work 45

          Bibliography 47

          Chapter 1

          Introduction

          11 Background

          Ballast water tanks are used on ships to stabilize the ship for different shippingloads When a ship isnrsquot fully loaded or when a ship must run at a sufficient depthwater is pumped into the ballast water tanks through a water pumping system Topreserve existing ecosystems as well as preventing the spread of bacteria larvaeor other microbes ballast water management is regulated world-wide by the Inter-national Convention for the Control and Management of Shipsrsquo Ballast Water andSediments (BWM convention)

          PureBallast 3 (PB3) is a ballast water treatment system (BWTS) designed by AlfaLaval that works as an extension to the shipsrsquo water pumping system The BWTSuses a filter for physical separation of organisms and total suspended solids (TSS)and a UV reactor for main treatment of the ballast water As PB3 can be installedon a variety of ships the BWTS must be able to process different waters underdifferent conditions to fulfill the requirements of the BWM convention

          12 Problem Description

          In the existing system there is currently no way of detecting if the filter in use isabout to clog A clogged filter forces a stop for the entire process and the onlyway to get the BWTS functional again is to physically remove the filter disas-semble it clean it reassemble it and put it back This cleaning process involvesrisks of damaging the filter is expensive to carry out and takes up unnecessary time

          Furthermore due to the different concentrations of the TSS in waters around theworld as well as different supply flows and supply pressures from various shippumping system the load on the BWTS may vary greatly This imposes problemsin estimating actual system health when using a static model as there are discrep-ancies introduced in the measured parameters

          1

          CHAPTER 1 INTRODUCTION

          These problems make it impossible to achieve optimal operability while ensuringthat the BWTS runs safely in every environment A desired solution is to developa method which can analyze and react to changes imposed to the system on thefly One such solution could be to use the large amounts of data generated by thesystem to create a neural network (NN) to estimate the state of the BWTS

          13 Purpose Definitions amp Research Questions

          The use of machine learning (ML) in this type of system is a new application of theotherwise popular statistical tool As there is no existing public information on howsuch an implementation can be done the focus of the thesis and the main researchquestion is

          bull To investigate and evaluate the possibility of using ML for predictively esti-mating filter clogging with focus on maritime systems

          An NN model will be developed and evaluated in terms of how accurately it canpredict the clogging of the filter in the BWTS using the systems sensor data Theimplementation of an NN into a system like this is the first of its kind and willrequire analysis and understanding of the system to ensure a working realizationTherefore with the BWTS in mind the original proposed research question can bespecified further to be

          bull How can an NN be integrated with a BWTS to estimate clogging of a basketfilter

          14 Scope and Delimitations

          In comparison to the initial thesis scope provided in Appendix A some delimita-tions have been made First and foremost the only part of interest of the BWTSis the filter When designing the NN the data used will be from in-house testingas well as from the cloud service Connectivity so no alternations to the existinghardware and software can be made For the purpose of focusing on the ML modeland the system sensors all data are assumed to be constantly available

          It is also not possible to test different kinds of ML methods or a plethora of NNswithin the time frame of the thesis For that reason the decision of developing anNN is based on its success in predictive analytics in other fields[1 2] and the frameof reference for deciding the most appropriate NN will depend on the characteristicsof the data and the future requirements of the NN

          2

          15 METHOD DESCRIPTION

          15 Method Description

          The task at hand is a combination of two On one hand it is an engineering taskinvestigating if an NN can be developed for the BWTS On the other hand it is ascientific task that focuses on applying and evaluating current research in the areasof NNs water filtration techniques time series analysis and predictive analytics tothe problem To complete both tasks within the time frame a methodology has tobe developed to ensure that the engineering task is done the research questions areanswered and that there is a clear foundation for future research The methodologyis visualized in Figure 11

          The basis of the methodology starts with the problem description The problemdescription makes way for establishing the first frame of reference (FOR) and rais-ing the initial potential research questions Following that a better understandingabout the research field in question and continuous discussions with Alfa Laval willhelp adapt the FOR further Working through this process iteratively a final framecan be established where the research area is clear allowing for finalization of theresearch questions

          With the frame of reference established the focus shifts to gathering informationthrough appropriate resources such as scientific articles and papers Interviews ex-isting documentation of the BWTS and future use of the NN also helps in narrowingdown the solution space to only contain relevant solutions that are optimal for thethesis With a smaller set of NNs to chose from the best suited network structurewill be developed tested and evaluated

          In preparation of constructing and implementing the NN the data have to beprocessed according to the needs of the selected network Pre-processing strate-gies on how datasets are best prepared for NN processing will be investigated andexecuted to ensure that a clear methodology for future processing is executed anddocumented For correctly classifying the current clogging grade or rate of cloggingof the filter state of the art research will be used as reference when commencingwith the labelling of the data

          When the classificationlabelling is done implementation through training and test-ing of the NN can begin Sequentially improvement to the NN structure will bemade by comparing the results of different initial weight conditions layer configu-rations and data partitions The experimental results will be graded in terms ofpredictive accuracy and the estimated errors of each individual parameter will betaken into consideration

          Lastly the validation process can begin to ensure that the initial requirementsfrom the problem description are either met or that they have been investigatedWith the results at hand a conclusion can be presented describing how the system

          3

          CHAPTER 1 INTRODUCTION

          can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

          Figure 11 Proposed methodology for the thesis

          4

          Chapter 2

          Frame of Reference

          This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

          21 Filtration amp Clogging Indicators

          Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

          To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

          211 Basket Filter

          A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

          5

          CHAPTER 2 FRAME OF REFERENCE

          Figure 21 An overview of a basket filter1

          The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

          212 Self-Cleaning Basket Filters

          Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

          1Source httpwwwfilter-technicsbe

          6

          21 FILTRATION amp CLOGGING INDICATORS

          Figure 22 An overview of a basket filter with self-cleaning2

          The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

          213 Manometer

          Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

          When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

          2Source httpwwwdirectindustrycom

          7

          CHAPTER 2 FRAME OF REFERENCE

          214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

          1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

          2 a decrease in Q as a result of an increase in ∆p

          These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

          1 steady state ∆p and Qrarr Nolittle clogging

          2 linear increase in ∆p and steady Qrarr Moderate clogging

          3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

          With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

          Figure 23 Visualization of the clogging states3

          3Source Eker et al [6]

          8

          21 FILTRATION amp CLOGGING INDICATORS

          215 Physics-based Modelling

          The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

          Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

          QL = KA

          microL∆p (21)

          rewritten as

          ∆p = microL

          KAQL (22)

          A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

          ∆p = kVsmicro

          Φ2D2p

          (1minus ε)2L

          ε3(23)

          Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

          ∆p = 150Vsmicro(1minus ε)2L

          D2pε

          3 + 175(1minus ε)ρV 2s L

          ε3Dp(24)

          where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

          Table 21 Variable explanation for Ergunrsquos equation

          Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

          Dp Diameter of the spherical particle mρ Density of the liquid kgm3

          9

          CHAPTER 2 FRAME OF REFERENCE

          Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

          22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

          Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

          Table 22 Outputs of a confusion matrix

          PredictionPositive Negative

          Act

          ual Positive True Positive (TP) False Positive (FP)

          Negative False Negative (FN) True Negative (TN)

          The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

          ACC =sumn

          i=1 jin

          where ji =

          1 if yi = yi

          0 if yi 6= yi

          (25)

          by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

          10

          22 PREDICTIVE ANALYTICS

          In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

          221 Classification Error Metrics

          Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

          Area Under Curve (AUC)

          AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

          sensitivity = TP

          TP + FN(26)

          specificity = TN

          TN + FP(27)

          The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

          F1 Score

          The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

          precision = TP

          TP + FP(28)

          recall = TP

          TP + FN(29)

          F1 = 2times precisiontimes recallprecision+ recall

          (210)

          11

          CHAPTER 2 FRAME OF REFERENCE

          Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

          Logarithmic Loss (Log Loss)

          For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

          LogLoss = minusMsum

          c=1yoclog(poc) (211)

          222 Regression Error Metrics

          Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

          Mean Absolute Error (MAE)

          Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

          MAE = 1n

          nsumi=1|yi minus yi| (212)

          Mean Squared Error (MSE)

          The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

          12

          22 PREDICTIVE ANALYTICS

          MSE = 1n

          nsumi=1

          (yi minus yi)2 (213)

          Root Mean Squared Error (RMSE)

          RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

          RMSE =

          radicradicradicradic 1n

          nsumi=1

          (yi minus yi)2 (214)

          The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

          partRMSE

          partyi= 1radic

          MSE

          partMSE

          partyi(215)

          Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

          Mean Square Percentage Error (MSPE)

          The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

          MSPE = 100n

          nsumi=1

          (yi minus yi

          yi

          )2(216)

          Mean Absolute Percentage Error (MAPE)

          The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

          MAPE = 100n

          nsumi=1

          ∣∣∣∣yi minus yi

          yi

          ∣∣∣∣ (217)

          13

          CHAPTER 2 FRAME OF REFERENCE

          Coefficient of Determination r2

          To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

          r2 =

          sumni=1((yi minus yi)(yi minus yi))2radicsumn

          i=1(yi minus yi)2sumni=1(yi minus yi)2

          2

          (218)

          r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

          Adjusted r2

          Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

          r2adj = 1minus

          [(1minusr2)(nminus1)

          nminuskminus1

          ](219)

          Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

          223 Stochastic Time Series Models

          Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

          Autoregressive Moving Average (ARMA)

          The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

          14

          23 NEURAL NETWORKS

          value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

          Autoregressive Integrated Moving Average (ARIMA)

          ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

          A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

          23 Neural Networks

          231 Overview

          NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

          15

          CHAPTER 2 FRAME OF REFERENCE

          properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

          232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

          output =

          0 if w middot x+ b le 01 if w middot x+ b gt 0

          (220)

          In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

          233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

          Sigmoid Function

          The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

          f(z) = σ(z) = 11 + eminusz

          (221)

          for

          z =sum

          j

          wj middot xj + b (222)

          16

          23 NEURAL NETWORKS

          Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

          Rectified Function

          The rectifier activation function is defined as the positive part of its argument [34]

          f(x) = x+ = max(0 x) (223)

          for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

          Swish Function

          Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

          f(x) = x middot sigmoid(βx) (224)

          where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

          234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

          Shallow Neural Networks (SNN)

          SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

          17

          CHAPTER 2 FRAME OF REFERENCE

          tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

          Deep Neural Networks (DNN)

          DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

          f(x) = f (1) + f (2) + + f (n) (225)

          where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

          Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

          Recurring Neural Networks(RNN)

          Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

          x1 =[0 0 1 1 0 0 0

          ]x2 =

          [0 0 0 1 1 0 0

          ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

          18

          23 NEURAL NETWORKS

          weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

          Long Short Term Memory (LSTM) Networks

          In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

          it = σ(ωi

          [htminus1 xt

          ]+ bi)

          ot = σ(ωo

          [htminus1 xt

          ]+ bo)

          ft = σ(ωf

          [htminus1 xt

          ]+ bf )

          (226)

          The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

          Gated Recurrent Units (GRU)

          GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

          19

          CHAPTER 2 FRAME OF REFERENCE

          Convolutional Neural Networks (CNN)

          The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

          The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

          Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

          Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

          20

          23 NEURAL NETWORKS

          Figure 25 A max pooling layer with pool size 2 pooling an input

          The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

          Figure 26 A flattening layer flattening the feature map

          21

          Chapter 3

          Experimental Development

          This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

          31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

          Figure 31 A complete test cycle

          23

          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

          During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

          Figure 32 A test cycle with the backflush stop cut from the data

          The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

          24

          31 DATA GATHERING AND PROCESSING

          Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

          Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

          Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

          As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

          25

          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

          the amount of data points and respective clogging labels for each test cycle can befound in Table 31

          Table 31 Amount of data available after preprocessing

          Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

          Total 3195 1012 2903

          When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

          32 Model Generation

          In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

          Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

          The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

          26

          32 MODEL GENERATION

          variables The encoding can be done for both integers and tags such as123

          rarr1 0 0

          0 1 00 0 1

          or

          redbluegreen

          rarr1 0 0

          0 1 00 0 1

          so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

          The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

          xi minusmin(x)max(x)minusmin(x) (31)

          Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

          321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

          X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

          ](32)

          X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

          ](33)

          27

          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

          When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

          bull Samples - The amount of data points

          bull Time steps - The points of observation of the samples

          bull Features - The observed variables per time step

          The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

          Figure 35 An overview of the LSTM network architecture

          The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

          322 Regression Processing with the CNN Model

          As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

          28

          32 MODEL GENERATION

          observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

          The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

          Figure 36 An overview of the CNN architecture

          Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

          323 Label Classification

          With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

          For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

          29

          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

          20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

          For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

          33 Model evaluation

          During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

          For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

          For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

          30

          34 HARDWARE SPECIFICATIONS

          Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

          34 Hardware Specifications

          The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

          Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

          31

          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

          The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

          The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

          32

          Chapter 4

          Results

          This chapter presents the results for all the models presented in the previous chapter

          41 LSTM Performance

          Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

          Figure 41 MAE and MSE loss for the LSTM

          33

          CHAPTER 4 RESULTS

          Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

          Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

          Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

          34

          41 LSTM PERFORMANCE

          Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

          Table 41 Evaluation metrics for the LSTM during regression analysis

          Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

          Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

          Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

          35

          CHAPTER 4 RESULTS

          Table 42 Evaluation metrics for the LSTM during classification analysis

          of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

          Table 43 LSTM confusion matrix

          PredictionLabel 1 Label 2

          Act

          ual Label 1 109 1

          Label 2 3 669

          42 CNN Performance

          Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

          Figure 47 MAE and MSE loss for the CNN

          36

          42 CNN PERFORMANCE

          Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

          Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

          Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

          37

          CHAPTER 4 RESULTS

          Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

          Table 44 Evaluation metrics for the CNN during regression analysis

          Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

          Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

          Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

          38

          42 CNN PERFORMANCE

          Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

          Table 45 Evaluation metrics for the CNN during classification analysis

          Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

          Table 46 CNN confusion matrix for data from the MAE regression network

          PredictionLabel 1 Label 2

          Act

          ual Label 1 82 29

          Label 2 38 631

          Table 47 CNN confusion matrix for data from the MSE regression network

          PredictionLabel 1 Label 2

          Act

          ual Label 1 69 41

          Label 2 11 659

          39

          Chapter 5

          Discussion amp Conclusion

          This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

          51 The LSTM Network

          511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

          Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

          The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

          41

          CHAPTER 5 DISCUSSION amp CONCLUSION

          while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

          512 Classification Analysis

          As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

          The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

          52 The CNN

          521 Regression Analysis

          The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

          Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

          42

          52 THE CNN

          is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

          Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

          522 Classification Analysis

          Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

          Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

          However the CNN is still doing a good job at predicting future clogging even

          43

          CHAPTER 5 DISCUSSION amp CONCLUSION

          up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

          53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

          54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

          As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

          44

          Chapter 6

          Future Work

          In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

          For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

          On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

          Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

          45

          Bibliography

          [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

          [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

          [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

          [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

          [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

          [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

          [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

          [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

          [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

          [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

          47

          BIBLIOGRAPHY

          [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

          [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

          [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

          [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

          [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

          [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

          [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

          [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

          [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

          [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

          [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

          48

          BIBLIOGRAPHY

          [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

          [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

          [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

          [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

          [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

          [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

          [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

          [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

          [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

          [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

          [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

          [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

          49

          BIBLIOGRAPHY

          models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

          [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

          [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

          [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

          [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

          [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

          [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

          [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

          [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

          [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

          [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

          50

          BIBLIOGRAPHY

          [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

          [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

          [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

          [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

          [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

          [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

          [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

          51

          TRITA TRITA-ITM-EX 2019606

          wwwkthse

          • Introduction
            • Background
            • Problem Description
            • Purpose Definitions amp Research Questions
            • Scope and Delimitations
            • Method Description
              • Frame of Reference
                • Filtration amp Clogging Indicators
                  • Basket Filter
                  • Self-Cleaning Basket Filters
                  • Manometer
                  • The Clogging Phenomena
                  • Physics-based Modelling
                    • Predictive Analytics
                      • Classification Error Metrics
                      • Regression Error Metrics
                      • Stochastic Time Series Models
                        • Neural Networks
                          • Overview
                          • The Perceptron
                          • Activation functions
                          • Neural Network Architectures
                              • Experimental Development
                                • Data Gathering and Processing
                                • Model Generation
                                  • Regression Processing with the LSTM Model
                                  • Regression Processing with the CNN Model
                                  • Label Classification
                                    • Model evaluation
                                    • Hardware Specifications
                                      • Results
                                        • LSTM Performance
                                        • CNN Performance
                                          • Discussion amp Conclusion
                                            • The LSTM Network
                                              • Regression Analysis
                                              • Classification Analysis
                                                • The CNN
                                                  • Regression Analysis
                                                  • Classification Analysis
                                                    • Comparison Between Both Networks
                                                    • Conclusion
                                                      • Future Work
                                                      • Bibliography

            4 Results 3341 LSTM Performance 3342 CNN Performance 36

            5 Discussion amp Conclusion 4151 The LSTM Network 41

            511 Regression Analysis 41512 Classification Analysis 42

            52 The CNN 42521 Regression Analysis 42522 Classification Analysis 43

            53 Comparison Between Both Networks 4454 Conclusion 44

            6 Future Work 45

            Bibliography 47

            Chapter 1

            Introduction

            11 Background

            Ballast water tanks are used on ships to stabilize the ship for different shippingloads When a ship isnrsquot fully loaded or when a ship must run at a sufficient depthwater is pumped into the ballast water tanks through a water pumping system Topreserve existing ecosystems as well as preventing the spread of bacteria larvaeor other microbes ballast water management is regulated world-wide by the Inter-national Convention for the Control and Management of Shipsrsquo Ballast Water andSediments (BWM convention)

            PureBallast 3 (PB3) is a ballast water treatment system (BWTS) designed by AlfaLaval that works as an extension to the shipsrsquo water pumping system The BWTSuses a filter for physical separation of organisms and total suspended solids (TSS)and a UV reactor for main treatment of the ballast water As PB3 can be installedon a variety of ships the BWTS must be able to process different waters underdifferent conditions to fulfill the requirements of the BWM convention

            12 Problem Description

            In the existing system there is currently no way of detecting if the filter in use isabout to clog A clogged filter forces a stop for the entire process and the onlyway to get the BWTS functional again is to physically remove the filter disas-semble it clean it reassemble it and put it back This cleaning process involvesrisks of damaging the filter is expensive to carry out and takes up unnecessary time

            Furthermore due to the different concentrations of the TSS in waters around theworld as well as different supply flows and supply pressures from various shippumping system the load on the BWTS may vary greatly This imposes problemsin estimating actual system health when using a static model as there are discrep-ancies introduced in the measured parameters

            1

            CHAPTER 1 INTRODUCTION

            These problems make it impossible to achieve optimal operability while ensuringthat the BWTS runs safely in every environment A desired solution is to developa method which can analyze and react to changes imposed to the system on thefly One such solution could be to use the large amounts of data generated by thesystem to create a neural network (NN) to estimate the state of the BWTS

            13 Purpose Definitions amp Research Questions

            The use of machine learning (ML) in this type of system is a new application of theotherwise popular statistical tool As there is no existing public information on howsuch an implementation can be done the focus of the thesis and the main researchquestion is

            bull To investigate and evaluate the possibility of using ML for predictively esti-mating filter clogging with focus on maritime systems

            An NN model will be developed and evaluated in terms of how accurately it canpredict the clogging of the filter in the BWTS using the systems sensor data Theimplementation of an NN into a system like this is the first of its kind and willrequire analysis and understanding of the system to ensure a working realizationTherefore with the BWTS in mind the original proposed research question can bespecified further to be

            bull How can an NN be integrated with a BWTS to estimate clogging of a basketfilter

            14 Scope and Delimitations

            In comparison to the initial thesis scope provided in Appendix A some delimita-tions have been made First and foremost the only part of interest of the BWTSis the filter When designing the NN the data used will be from in-house testingas well as from the cloud service Connectivity so no alternations to the existinghardware and software can be made For the purpose of focusing on the ML modeland the system sensors all data are assumed to be constantly available

            It is also not possible to test different kinds of ML methods or a plethora of NNswithin the time frame of the thesis For that reason the decision of developing anNN is based on its success in predictive analytics in other fields[1 2] and the frameof reference for deciding the most appropriate NN will depend on the characteristicsof the data and the future requirements of the NN

            2

            15 METHOD DESCRIPTION

            15 Method Description

            The task at hand is a combination of two On one hand it is an engineering taskinvestigating if an NN can be developed for the BWTS On the other hand it is ascientific task that focuses on applying and evaluating current research in the areasof NNs water filtration techniques time series analysis and predictive analytics tothe problem To complete both tasks within the time frame a methodology has tobe developed to ensure that the engineering task is done the research questions areanswered and that there is a clear foundation for future research The methodologyis visualized in Figure 11

            The basis of the methodology starts with the problem description The problemdescription makes way for establishing the first frame of reference (FOR) and rais-ing the initial potential research questions Following that a better understandingabout the research field in question and continuous discussions with Alfa Laval willhelp adapt the FOR further Working through this process iteratively a final framecan be established where the research area is clear allowing for finalization of theresearch questions

            With the frame of reference established the focus shifts to gathering informationthrough appropriate resources such as scientific articles and papers Interviews ex-isting documentation of the BWTS and future use of the NN also helps in narrowingdown the solution space to only contain relevant solutions that are optimal for thethesis With a smaller set of NNs to chose from the best suited network structurewill be developed tested and evaluated

            In preparation of constructing and implementing the NN the data have to beprocessed according to the needs of the selected network Pre-processing strate-gies on how datasets are best prepared for NN processing will be investigated andexecuted to ensure that a clear methodology for future processing is executed anddocumented For correctly classifying the current clogging grade or rate of cloggingof the filter state of the art research will be used as reference when commencingwith the labelling of the data

            When the classificationlabelling is done implementation through training and test-ing of the NN can begin Sequentially improvement to the NN structure will bemade by comparing the results of different initial weight conditions layer configu-rations and data partitions The experimental results will be graded in terms ofpredictive accuracy and the estimated errors of each individual parameter will betaken into consideration

            Lastly the validation process can begin to ensure that the initial requirementsfrom the problem description are either met or that they have been investigatedWith the results at hand a conclusion can be presented describing how the system

            3

            CHAPTER 1 INTRODUCTION

            can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

            Figure 11 Proposed methodology for the thesis

            4

            Chapter 2

            Frame of Reference

            This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

            21 Filtration amp Clogging Indicators

            Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

            To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

            211 Basket Filter

            A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

            5

            CHAPTER 2 FRAME OF REFERENCE

            Figure 21 An overview of a basket filter1

            The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

            212 Self-Cleaning Basket Filters

            Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

            1Source httpwwwfilter-technicsbe

            6

            21 FILTRATION amp CLOGGING INDICATORS

            Figure 22 An overview of a basket filter with self-cleaning2

            The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

            213 Manometer

            Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

            When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

            2Source httpwwwdirectindustrycom

            7

            CHAPTER 2 FRAME OF REFERENCE

            214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

            1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

            2 a decrease in Q as a result of an increase in ∆p

            These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

            1 steady state ∆p and Qrarr Nolittle clogging

            2 linear increase in ∆p and steady Qrarr Moderate clogging

            3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

            With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

            Figure 23 Visualization of the clogging states3

            3Source Eker et al [6]

            8

            21 FILTRATION amp CLOGGING INDICATORS

            215 Physics-based Modelling

            The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

            Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

            QL = KA

            microL∆p (21)

            rewritten as

            ∆p = microL

            KAQL (22)

            A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

            ∆p = kVsmicro

            Φ2D2p

            (1minus ε)2L

            ε3(23)

            Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

            ∆p = 150Vsmicro(1minus ε)2L

            D2pε

            3 + 175(1minus ε)ρV 2s L

            ε3Dp(24)

            where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

            Table 21 Variable explanation for Ergunrsquos equation

            Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

            Dp Diameter of the spherical particle mρ Density of the liquid kgm3

            9

            CHAPTER 2 FRAME OF REFERENCE

            Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

            22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

            Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

            Table 22 Outputs of a confusion matrix

            PredictionPositive Negative

            Act

            ual Positive True Positive (TP) False Positive (FP)

            Negative False Negative (FN) True Negative (TN)

            The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

            ACC =sumn

            i=1 jin

            where ji =

            1 if yi = yi

            0 if yi 6= yi

            (25)

            by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

            10

            22 PREDICTIVE ANALYTICS

            In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

            221 Classification Error Metrics

            Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

            Area Under Curve (AUC)

            AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

            sensitivity = TP

            TP + FN(26)

            specificity = TN

            TN + FP(27)

            The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

            F1 Score

            The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

            precision = TP

            TP + FP(28)

            recall = TP

            TP + FN(29)

            F1 = 2times precisiontimes recallprecision+ recall

            (210)

            11

            CHAPTER 2 FRAME OF REFERENCE

            Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

            Logarithmic Loss (Log Loss)

            For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

            LogLoss = minusMsum

            c=1yoclog(poc) (211)

            222 Regression Error Metrics

            Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

            Mean Absolute Error (MAE)

            Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

            MAE = 1n

            nsumi=1|yi minus yi| (212)

            Mean Squared Error (MSE)

            The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

            12

            22 PREDICTIVE ANALYTICS

            MSE = 1n

            nsumi=1

            (yi minus yi)2 (213)

            Root Mean Squared Error (RMSE)

            RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

            RMSE =

            radicradicradicradic 1n

            nsumi=1

            (yi minus yi)2 (214)

            The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

            partRMSE

            partyi= 1radic

            MSE

            partMSE

            partyi(215)

            Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

            Mean Square Percentage Error (MSPE)

            The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

            MSPE = 100n

            nsumi=1

            (yi minus yi

            yi

            )2(216)

            Mean Absolute Percentage Error (MAPE)

            The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

            MAPE = 100n

            nsumi=1

            ∣∣∣∣yi minus yi

            yi

            ∣∣∣∣ (217)

            13

            CHAPTER 2 FRAME OF REFERENCE

            Coefficient of Determination r2

            To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

            r2 =

            sumni=1((yi minus yi)(yi minus yi))2radicsumn

            i=1(yi minus yi)2sumni=1(yi minus yi)2

            2

            (218)

            r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

            Adjusted r2

            Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

            r2adj = 1minus

            [(1minusr2)(nminus1)

            nminuskminus1

            ](219)

            Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

            223 Stochastic Time Series Models

            Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

            Autoregressive Moving Average (ARMA)

            The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

            14

            23 NEURAL NETWORKS

            value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

            Autoregressive Integrated Moving Average (ARIMA)

            ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

            A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

            23 Neural Networks

            231 Overview

            NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

            15

            CHAPTER 2 FRAME OF REFERENCE

            properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

            232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

            output =

            0 if w middot x+ b le 01 if w middot x+ b gt 0

            (220)

            In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

            233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

            Sigmoid Function

            The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

            f(z) = σ(z) = 11 + eminusz

            (221)

            for

            z =sum

            j

            wj middot xj + b (222)

            16

            23 NEURAL NETWORKS

            Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

            Rectified Function

            The rectifier activation function is defined as the positive part of its argument [34]

            f(x) = x+ = max(0 x) (223)

            for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

            Swish Function

            Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

            f(x) = x middot sigmoid(βx) (224)

            where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

            234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

            Shallow Neural Networks (SNN)

            SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

            17

            CHAPTER 2 FRAME OF REFERENCE

            tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

            Deep Neural Networks (DNN)

            DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

            f(x) = f (1) + f (2) + + f (n) (225)

            where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

            Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

            Recurring Neural Networks(RNN)

            Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

            x1 =[0 0 1 1 0 0 0

            ]x2 =

            [0 0 0 1 1 0 0

            ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

            18

            23 NEURAL NETWORKS

            weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

            Long Short Term Memory (LSTM) Networks

            In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

            it = σ(ωi

            [htminus1 xt

            ]+ bi)

            ot = σ(ωo

            [htminus1 xt

            ]+ bo)

            ft = σ(ωf

            [htminus1 xt

            ]+ bf )

            (226)

            The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

            Gated Recurrent Units (GRU)

            GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

            19

            CHAPTER 2 FRAME OF REFERENCE

            Convolutional Neural Networks (CNN)

            The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

            The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

            Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

            Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

            20

            23 NEURAL NETWORKS

            Figure 25 A max pooling layer with pool size 2 pooling an input

            The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

            Figure 26 A flattening layer flattening the feature map

            21

            Chapter 3

            Experimental Development

            This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

            31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

            Figure 31 A complete test cycle

            23

            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

            During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

            Figure 32 A test cycle with the backflush stop cut from the data

            The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

            24

            31 DATA GATHERING AND PROCESSING

            Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

            Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

            Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

            As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

            25

            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

            the amount of data points and respective clogging labels for each test cycle can befound in Table 31

            Table 31 Amount of data available after preprocessing

            Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

            Total 3195 1012 2903

            When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

            32 Model Generation

            In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

            Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

            The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

            26

            32 MODEL GENERATION

            variables The encoding can be done for both integers and tags such as123

            rarr1 0 0

            0 1 00 0 1

            or

            redbluegreen

            rarr1 0 0

            0 1 00 0 1

            so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

            The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

            xi minusmin(x)max(x)minusmin(x) (31)

            Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

            321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

            X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

            ](32)

            X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

            ](33)

            27

            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

            When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

            bull Samples - The amount of data points

            bull Time steps - The points of observation of the samples

            bull Features - The observed variables per time step

            The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

            Figure 35 An overview of the LSTM network architecture

            The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

            322 Regression Processing with the CNN Model

            As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

            28

            32 MODEL GENERATION

            observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

            The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

            Figure 36 An overview of the CNN architecture

            Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

            323 Label Classification

            With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

            For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

            29

            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

            20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

            For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

            33 Model evaluation

            During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

            For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

            For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

            30

            34 HARDWARE SPECIFICATIONS

            Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

            34 Hardware Specifications

            The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

            Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

            31

            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

            The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

            The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

            32

            Chapter 4

            Results

            This chapter presents the results for all the models presented in the previous chapter

            41 LSTM Performance

            Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

            Figure 41 MAE and MSE loss for the LSTM

            33

            CHAPTER 4 RESULTS

            Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

            Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

            Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

            34

            41 LSTM PERFORMANCE

            Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

            Table 41 Evaluation metrics for the LSTM during regression analysis

            Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

            Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

            Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

            35

            CHAPTER 4 RESULTS

            Table 42 Evaluation metrics for the LSTM during classification analysis

            of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

            Table 43 LSTM confusion matrix

            PredictionLabel 1 Label 2

            Act

            ual Label 1 109 1

            Label 2 3 669

            42 CNN Performance

            Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

            Figure 47 MAE and MSE loss for the CNN

            36

            42 CNN PERFORMANCE

            Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

            Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

            Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

            37

            CHAPTER 4 RESULTS

            Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

            Table 44 Evaluation metrics for the CNN during regression analysis

            Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

            Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

            Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

            38

            42 CNN PERFORMANCE

            Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

            Table 45 Evaluation metrics for the CNN during classification analysis

            Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

            Table 46 CNN confusion matrix for data from the MAE regression network

            PredictionLabel 1 Label 2

            Act

            ual Label 1 82 29

            Label 2 38 631

            Table 47 CNN confusion matrix for data from the MSE regression network

            PredictionLabel 1 Label 2

            Act

            ual Label 1 69 41

            Label 2 11 659

            39

            Chapter 5

            Discussion amp Conclusion

            This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

            51 The LSTM Network

            511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

            Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

            The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

            41

            CHAPTER 5 DISCUSSION amp CONCLUSION

            while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

            512 Classification Analysis

            As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

            The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

            52 The CNN

            521 Regression Analysis

            The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

            Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

            42

            52 THE CNN

            is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

            Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

            522 Classification Analysis

            Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

            Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

            However the CNN is still doing a good job at predicting future clogging even

            43

            CHAPTER 5 DISCUSSION amp CONCLUSION

            up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

            53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

            54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

            As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

            44

            Chapter 6

            Future Work

            In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

            For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

            On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

            Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

            45

            Bibliography

            [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

            [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

            [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

            [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

            [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

            [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

            [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

            [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

            [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

            [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

            47

            BIBLIOGRAPHY

            [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

            [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

            [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

            [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

            [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

            [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

            [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

            [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

            [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

            [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

            [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

            48

            BIBLIOGRAPHY

            [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

            [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

            [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

            [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

            [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

            [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

            [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

            [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

            [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

            [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

            [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

            [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

            49

            BIBLIOGRAPHY

            models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

            [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

            [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

            [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

            [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

            [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

            [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

            [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

            [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

            [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

            [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

            50

            BIBLIOGRAPHY

            [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

            [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

            [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

            [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

            [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

            [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

            [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

            51

            TRITA TRITA-ITM-EX 2019606

            wwwkthse

            • Introduction
              • Background
              • Problem Description
              • Purpose Definitions amp Research Questions
              • Scope and Delimitations
              • Method Description
                • Frame of Reference
                  • Filtration amp Clogging Indicators
                    • Basket Filter
                    • Self-Cleaning Basket Filters
                    • Manometer
                    • The Clogging Phenomena
                    • Physics-based Modelling
                      • Predictive Analytics
                        • Classification Error Metrics
                        • Regression Error Metrics
                        • Stochastic Time Series Models
                          • Neural Networks
                            • Overview
                            • The Perceptron
                            • Activation functions
                            • Neural Network Architectures
                                • Experimental Development
                                  • Data Gathering and Processing
                                  • Model Generation
                                    • Regression Processing with the LSTM Model
                                    • Regression Processing with the CNN Model
                                    • Label Classification
                                      • Model evaluation
                                      • Hardware Specifications
                                        • Results
                                          • LSTM Performance
                                          • CNN Performance
                                            • Discussion amp Conclusion
                                              • The LSTM Network
                                                • Regression Analysis
                                                • Classification Analysis
                                                  • The CNN
                                                    • Regression Analysis
                                                    • Classification Analysis
                                                      • Comparison Between Both Networks
                                                      • Conclusion
                                                        • Future Work
                                                        • Bibliography

              Chapter 1

              Introduction

              11 Background

              Ballast water tanks are used on ships to stabilize the ship for different shippingloads When a ship isnrsquot fully loaded or when a ship must run at a sufficient depthwater is pumped into the ballast water tanks through a water pumping system Topreserve existing ecosystems as well as preventing the spread of bacteria larvaeor other microbes ballast water management is regulated world-wide by the Inter-national Convention for the Control and Management of Shipsrsquo Ballast Water andSediments (BWM convention)

              PureBallast 3 (PB3) is a ballast water treatment system (BWTS) designed by AlfaLaval that works as an extension to the shipsrsquo water pumping system The BWTSuses a filter for physical separation of organisms and total suspended solids (TSS)and a UV reactor for main treatment of the ballast water As PB3 can be installedon a variety of ships the BWTS must be able to process different waters underdifferent conditions to fulfill the requirements of the BWM convention

              12 Problem Description

              In the existing system there is currently no way of detecting if the filter in use isabout to clog A clogged filter forces a stop for the entire process and the onlyway to get the BWTS functional again is to physically remove the filter disas-semble it clean it reassemble it and put it back This cleaning process involvesrisks of damaging the filter is expensive to carry out and takes up unnecessary time

              Furthermore due to the different concentrations of the TSS in waters around theworld as well as different supply flows and supply pressures from various shippumping system the load on the BWTS may vary greatly This imposes problemsin estimating actual system health when using a static model as there are discrep-ancies introduced in the measured parameters

              1

              CHAPTER 1 INTRODUCTION

              These problems make it impossible to achieve optimal operability while ensuringthat the BWTS runs safely in every environment A desired solution is to developa method which can analyze and react to changes imposed to the system on thefly One such solution could be to use the large amounts of data generated by thesystem to create a neural network (NN) to estimate the state of the BWTS

              13 Purpose Definitions amp Research Questions

              The use of machine learning (ML) in this type of system is a new application of theotherwise popular statistical tool As there is no existing public information on howsuch an implementation can be done the focus of the thesis and the main researchquestion is

              bull To investigate and evaluate the possibility of using ML for predictively esti-mating filter clogging with focus on maritime systems

              An NN model will be developed and evaluated in terms of how accurately it canpredict the clogging of the filter in the BWTS using the systems sensor data Theimplementation of an NN into a system like this is the first of its kind and willrequire analysis and understanding of the system to ensure a working realizationTherefore with the BWTS in mind the original proposed research question can bespecified further to be

              bull How can an NN be integrated with a BWTS to estimate clogging of a basketfilter

              14 Scope and Delimitations

              In comparison to the initial thesis scope provided in Appendix A some delimita-tions have been made First and foremost the only part of interest of the BWTSis the filter When designing the NN the data used will be from in-house testingas well as from the cloud service Connectivity so no alternations to the existinghardware and software can be made For the purpose of focusing on the ML modeland the system sensors all data are assumed to be constantly available

              It is also not possible to test different kinds of ML methods or a plethora of NNswithin the time frame of the thesis For that reason the decision of developing anNN is based on its success in predictive analytics in other fields[1 2] and the frameof reference for deciding the most appropriate NN will depend on the characteristicsof the data and the future requirements of the NN

              2

              15 METHOD DESCRIPTION

              15 Method Description

              The task at hand is a combination of two On one hand it is an engineering taskinvestigating if an NN can be developed for the BWTS On the other hand it is ascientific task that focuses on applying and evaluating current research in the areasof NNs water filtration techniques time series analysis and predictive analytics tothe problem To complete both tasks within the time frame a methodology has tobe developed to ensure that the engineering task is done the research questions areanswered and that there is a clear foundation for future research The methodologyis visualized in Figure 11

              The basis of the methodology starts with the problem description The problemdescription makes way for establishing the first frame of reference (FOR) and rais-ing the initial potential research questions Following that a better understandingabout the research field in question and continuous discussions with Alfa Laval willhelp adapt the FOR further Working through this process iteratively a final framecan be established where the research area is clear allowing for finalization of theresearch questions

              With the frame of reference established the focus shifts to gathering informationthrough appropriate resources such as scientific articles and papers Interviews ex-isting documentation of the BWTS and future use of the NN also helps in narrowingdown the solution space to only contain relevant solutions that are optimal for thethesis With a smaller set of NNs to chose from the best suited network structurewill be developed tested and evaluated

              In preparation of constructing and implementing the NN the data have to beprocessed according to the needs of the selected network Pre-processing strate-gies on how datasets are best prepared for NN processing will be investigated andexecuted to ensure that a clear methodology for future processing is executed anddocumented For correctly classifying the current clogging grade or rate of cloggingof the filter state of the art research will be used as reference when commencingwith the labelling of the data

              When the classificationlabelling is done implementation through training and test-ing of the NN can begin Sequentially improvement to the NN structure will bemade by comparing the results of different initial weight conditions layer configu-rations and data partitions The experimental results will be graded in terms ofpredictive accuracy and the estimated errors of each individual parameter will betaken into consideration

              Lastly the validation process can begin to ensure that the initial requirementsfrom the problem description are either met or that they have been investigatedWith the results at hand a conclusion can be presented describing how the system

              3

              CHAPTER 1 INTRODUCTION

              can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

              Figure 11 Proposed methodology for the thesis

              4

              Chapter 2

              Frame of Reference

              This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

              21 Filtration amp Clogging Indicators

              Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

              To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

              211 Basket Filter

              A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

              5

              CHAPTER 2 FRAME OF REFERENCE

              Figure 21 An overview of a basket filter1

              The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

              212 Self-Cleaning Basket Filters

              Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

              1Source httpwwwfilter-technicsbe

              6

              21 FILTRATION amp CLOGGING INDICATORS

              Figure 22 An overview of a basket filter with self-cleaning2

              The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

              213 Manometer

              Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

              When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

              2Source httpwwwdirectindustrycom

              7

              CHAPTER 2 FRAME OF REFERENCE

              214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

              1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

              2 a decrease in Q as a result of an increase in ∆p

              These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

              1 steady state ∆p and Qrarr Nolittle clogging

              2 linear increase in ∆p and steady Qrarr Moderate clogging

              3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

              With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

              Figure 23 Visualization of the clogging states3

              3Source Eker et al [6]

              8

              21 FILTRATION amp CLOGGING INDICATORS

              215 Physics-based Modelling

              The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

              Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

              QL = KA

              microL∆p (21)

              rewritten as

              ∆p = microL

              KAQL (22)

              A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

              ∆p = kVsmicro

              Φ2D2p

              (1minus ε)2L

              ε3(23)

              Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

              ∆p = 150Vsmicro(1minus ε)2L

              D2pε

              3 + 175(1minus ε)ρV 2s L

              ε3Dp(24)

              where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

              Table 21 Variable explanation for Ergunrsquos equation

              Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

              Dp Diameter of the spherical particle mρ Density of the liquid kgm3

              9

              CHAPTER 2 FRAME OF REFERENCE

              Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

              22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

              Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

              Table 22 Outputs of a confusion matrix

              PredictionPositive Negative

              Act

              ual Positive True Positive (TP) False Positive (FP)

              Negative False Negative (FN) True Negative (TN)

              The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

              ACC =sumn

              i=1 jin

              where ji =

              1 if yi = yi

              0 if yi 6= yi

              (25)

              by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

              10

              22 PREDICTIVE ANALYTICS

              In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

              221 Classification Error Metrics

              Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

              Area Under Curve (AUC)

              AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

              sensitivity = TP

              TP + FN(26)

              specificity = TN

              TN + FP(27)

              The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

              F1 Score

              The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

              precision = TP

              TP + FP(28)

              recall = TP

              TP + FN(29)

              F1 = 2times precisiontimes recallprecision+ recall

              (210)

              11

              CHAPTER 2 FRAME OF REFERENCE

              Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

              Logarithmic Loss (Log Loss)

              For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

              LogLoss = minusMsum

              c=1yoclog(poc) (211)

              222 Regression Error Metrics

              Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

              Mean Absolute Error (MAE)

              Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

              MAE = 1n

              nsumi=1|yi minus yi| (212)

              Mean Squared Error (MSE)

              The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

              12

              22 PREDICTIVE ANALYTICS

              MSE = 1n

              nsumi=1

              (yi minus yi)2 (213)

              Root Mean Squared Error (RMSE)

              RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

              RMSE =

              radicradicradicradic 1n

              nsumi=1

              (yi minus yi)2 (214)

              The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

              partRMSE

              partyi= 1radic

              MSE

              partMSE

              partyi(215)

              Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

              Mean Square Percentage Error (MSPE)

              The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

              MSPE = 100n

              nsumi=1

              (yi minus yi

              yi

              )2(216)

              Mean Absolute Percentage Error (MAPE)

              The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

              MAPE = 100n

              nsumi=1

              ∣∣∣∣yi minus yi

              yi

              ∣∣∣∣ (217)

              13

              CHAPTER 2 FRAME OF REFERENCE

              Coefficient of Determination r2

              To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

              r2 =

              sumni=1((yi minus yi)(yi minus yi))2radicsumn

              i=1(yi minus yi)2sumni=1(yi minus yi)2

              2

              (218)

              r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

              Adjusted r2

              Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

              r2adj = 1minus

              [(1minusr2)(nminus1)

              nminuskminus1

              ](219)

              Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

              223 Stochastic Time Series Models

              Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

              Autoregressive Moving Average (ARMA)

              The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

              14

              23 NEURAL NETWORKS

              value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

              Autoregressive Integrated Moving Average (ARIMA)

              ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

              A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

              23 Neural Networks

              231 Overview

              NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

              15

              CHAPTER 2 FRAME OF REFERENCE

              properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

              232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

              output =

              0 if w middot x+ b le 01 if w middot x+ b gt 0

              (220)

              In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

              233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

              Sigmoid Function

              The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

              f(z) = σ(z) = 11 + eminusz

              (221)

              for

              z =sum

              j

              wj middot xj + b (222)

              16

              23 NEURAL NETWORKS

              Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

              Rectified Function

              The rectifier activation function is defined as the positive part of its argument [34]

              f(x) = x+ = max(0 x) (223)

              for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

              Swish Function

              Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

              f(x) = x middot sigmoid(βx) (224)

              where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

              234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

              Shallow Neural Networks (SNN)

              SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

              17

              CHAPTER 2 FRAME OF REFERENCE

              tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

              Deep Neural Networks (DNN)

              DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

              f(x) = f (1) + f (2) + + f (n) (225)

              where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

              Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

              Recurring Neural Networks(RNN)

              Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

              x1 =[0 0 1 1 0 0 0

              ]x2 =

              [0 0 0 1 1 0 0

              ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

              18

              23 NEURAL NETWORKS

              weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

              Long Short Term Memory (LSTM) Networks

              In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

              it = σ(ωi

              [htminus1 xt

              ]+ bi)

              ot = σ(ωo

              [htminus1 xt

              ]+ bo)

              ft = σ(ωf

              [htminus1 xt

              ]+ bf )

              (226)

              The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

              Gated Recurrent Units (GRU)

              GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

              19

              CHAPTER 2 FRAME OF REFERENCE

              Convolutional Neural Networks (CNN)

              The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

              The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

              Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

              Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

              20

              23 NEURAL NETWORKS

              Figure 25 A max pooling layer with pool size 2 pooling an input

              The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

              Figure 26 A flattening layer flattening the feature map

              21

              Chapter 3

              Experimental Development

              This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

              31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

              Figure 31 A complete test cycle

              23

              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

              During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

              Figure 32 A test cycle with the backflush stop cut from the data

              The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

              24

              31 DATA GATHERING AND PROCESSING

              Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

              Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

              Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

              As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

              25

              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

              the amount of data points and respective clogging labels for each test cycle can befound in Table 31

              Table 31 Amount of data available after preprocessing

              Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

              Total 3195 1012 2903

              When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

              32 Model Generation

              In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

              Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

              The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

              26

              32 MODEL GENERATION

              variables The encoding can be done for both integers and tags such as123

              rarr1 0 0

              0 1 00 0 1

              or

              redbluegreen

              rarr1 0 0

              0 1 00 0 1

              so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

              The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

              xi minusmin(x)max(x)minusmin(x) (31)

              Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

              321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

              X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

              ](32)

              X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

              ](33)

              27

              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

              When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

              bull Samples - The amount of data points

              bull Time steps - The points of observation of the samples

              bull Features - The observed variables per time step

              The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

              Figure 35 An overview of the LSTM network architecture

              The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

              322 Regression Processing with the CNN Model

              As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

              28

              32 MODEL GENERATION

              observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

              The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

              Figure 36 An overview of the CNN architecture

              Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

              323 Label Classification

              With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

              For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

              29

              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

              20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

              For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

              33 Model evaluation

              During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

              For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

              For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

              30

              34 HARDWARE SPECIFICATIONS

              Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

              34 Hardware Specifications

              The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

              Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

              31

              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

              The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

              The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

              32

              Chapter 4

              Results

              This chapter presents the results for all the models presented in the previous chapter

              41 LSTM Performance

              Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

              Figure 41 MAE and MSE loss for the LSTM

              33

              CHAPTER 4 RESULTS

              Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

              Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

              Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

              34

              41 LSTM PERFORMANCE

              Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

              Table 41 Evaluation metrics for the LSTM during regression analysis

              Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

              Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

              Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

              35

              CHAPTER 4 RESULTS

              Table 42 Evaluation metrics for the LSTM during classification analysis

              of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

              Table 43 LSTM confusion matrix

              PredictionLabel 1 Label 2

              Act

              ual Label 1 109 1

              Label 2 3 669

              42 CNN Performance

              Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

              Figure 47 MAE and MSE loss for the CNN

              36

              42 CNN PERFORMANCE

              Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

              Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

              Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

              37

              CHAPTER 4 RESULTS

              Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

              Table 44 Evaluation metrics for the CNN during regression analysis

              Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

              Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

              Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

              38

              42 CNN PERFORMANCE

              Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

              Table 45 Evaluation metrics for the CNN during classification analysis

              Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

              Table 46 CNN confusion matrix for data from the MAE regression network

              PredictionLabel 1 Label 2

              Act

              ual Label 1 82 29

              Label 2 38 631

              Table 47 CNN confusion matrix for data from the MSE regression network

              PredictionLabel 1 Label 2

              Act

              ual Label 1 69 41

              Label 2 11 659

              39

              Chapter 5

              Discussion amp Conclusion

              This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

              51 The LSTM Network

              511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

              Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

              The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

              41

              CHAPTER 5 DISCUSSION amp CONCLUSION

              while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

              512 Classification Analysis

              As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

              The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

              52 The CNN

              521 Regression Analysis

              The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

              Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

              42

              52 THE CNN

              is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

              Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

              522 Classification Analysis

              Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

              Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

              However the CNN is still doing a good job at predicting future clogging even

              43

              CHAPTER 5 DISCUSSION amp CONCLUSION

              up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

              53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

              54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

              As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

              44

              Chapter 6

              Future Work

              In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

              For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

              On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

              Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

              45

              Bibliography

              [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

              [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

              [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

              [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

              [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

              [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

              [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

              [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

              [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

              [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

              47

              BIBLIOGRAPHY

              [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

              [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

              [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

              [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

              [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

              [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

              [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

              [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

              [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

              [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

              [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

              48

              BIBLIOGRAPHY

              [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

              [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

              [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

              [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

              [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

              [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

              [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

              [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

              [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

              [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

              [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

              [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

              49

              BIBLIOGRAPHY

              models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

              [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

              [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

              [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

              [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

              [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

              [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

              [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

              [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

              [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

              [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

              50

              BIBLIOGRAPHY

              [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

              [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

              [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

              [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

              [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

              [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

              [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

              51

              TRITA TRITA-ITM-EX 2019606

              wwwkthse

              • Introduction
                • Background
                • Problem Description
                • Purpose Definitions amp Research Questions
                • Scope and Delimitations
                • Method Description
                  • Frame of Reference
                    • Filtration amp Clogging Indicators
                      • Basket Filter
                      • Self-Cleaning Basket Filters
                      • Manometer
                      • The Clogging Phenomena
                      • Physics-based Modelling
                        • Predictive Analytics
                          • Classification Error Metrics
                          • Regression Error Metrics
                          • Stochastic Time Series Models
                            • Neural Networks
                              • Overview
                              • The Perceptron
                              • Activation functions
                              • Neural Network Architectures
                                  • Experimental Development
                                    • Data Gathering and Processing
                                    • Model Generation
                                      • Regression Processing with the LSTM Model
                                      • Regression Processing with the CNN Model
                                      • Label Classification
                                        • Model evaluation
                                        • Hardware Specifications
                                          • Results
                                            • LSTM Performance
                                            • CNN Performance
                                              • Discussion amp Conclusion
                                                • The LSTM Network
                                                  • Regression Analysis
                                                  • Classification Analysis
                                                    • The CNN
                                                      • Regression Analysis
                                                      • Classification Analysis
                                                        • Comparison Between Both Networks
                                                        • Conclusion
                                                          • Future Work
                                                          • Bibliography

                CHAPTER 1 INTRODUCTION

                These problems make it impossible to achieve optimal operability while ensuringthat the BWTS runs safely in every environment A desired solution is to developa method which can analyze and react to changes imposed to the system on thefly One such solution could be to use the large amounts of data generated by thesystem to create a neural network (NN) to estimate the state of the BWTS

                13 Purpose Definitions amp Research Questions

                The use of machine learning (ML) in this type of system is a new application of theotherwise popular statistical tool As there is no existing public information on howsuch an implementation can be done the focus of the thesis and the main researchquestion is

                bull To investigate and evaluate the possibility of using ML for predictively esti-mating filter clogging with focus on maritime systems

                An NN model will be developed and evaluated in terms of how accurately it canpredict the clogging of the filter in the BWTS using the systems sensor data Theimplementation of an NN into a system like this is the first of its kind and willrequire analysis and understanding of the system to ensure a working realizationTherefore with the BWTS in mind the original proposed research question can bespecified further to be

                bull How can an NN be integrated with a BWTS to estimate clogging of a basketfilter

                14 Scope and Delimitations

                In comparison to the initial thesis scope provided in Appendix A some delimita-tions have been made First and foremost the only part of interest of the BWTSis the filter When designing the NN the data used will be from in-house testingas well as from the cloud service Connectivity so no alternations to the existinghardware and software can be made For the purpose of focusing on the ML modeland the system sensors all data are assumed to be constantly available

                It is also not possible to test different kinds of ML methods or a plethora of NNswithin the time frame of the thesis For that reason the decision of developing anNN is based on its success in predictive analytics in other fields[1 2] and the frameof reference for deciding the most appropriate NN will depend on the characteristicsof the data and the future requirements of the NN

                2

                15 METHOD DESCRIPTION

                15 Method Description

                The task at hand is a combination of two On one hand it is an engineering taskinvestigating if an NN can be developed for the BWTS On the other hand it is ascientific task that focuses on applying and evaluating current research in the areasof NNs water filtration techniques time series analysis and predictive analytics tothe problem To complete both tasks within the time frame a methodology has tobe developed to ensure that the engineering task is done the research questions areanswered and that there is a clear foundation for future research The methodologyis visualized in Figure 11

                The basis of the methodology starts with the problem description The problemdescription makes way for establishing the first frame of reference (FOR) and rais-ing the initial potential research questions Following that a better understandingabout the research field in question and continuous discussions with Alfa Laval willhelp adapt the FOR further Working through this process iteratively a final framecan be established where the research area is clear allowing for finalization of theresearch questions

                With the frame of reference established the focus shifts to gathering informationthrough appropriate resources such as scientific articles and papers Interviews ex-isting documentation of the BWTS and future use of the NN also helps in narrowingdown the solution space to only contain relevant solutions that are optimal for thethesis With a smaller set of NNs to chose from the best suited network structurewill be developed tested and evaluated

                In preparation of constructing and implementing the NN the data have to beprocessed according to the needs of the selected network Pre-processing strate-gies on how datasets are best prepared for NN processing will be investigated andexecuted to ensure that a clear methodology for future processing is executed anddocumented For correctly classifying the current clogging grade or rate of cloggingof the filter state of the art research will be used as reference when commencingwith the labelling of the data

                When the classificationlabelling is done implementation through training and test-ing of the NN can begin Sequentially improvement to the NN structure will bemade by comparing the results of different initial weight conditions layer configu-rations and data partitions The experimental results will be graded in terms ofpredictive accuracy and the estimated errors of each individual parameter will betaken into consideration

                Lastly the validation process can begin to ensure that the initial requirementsfrom the problem description are either met or that they have been investigatedWith the results at hand a conclusion can be presented describing how the system

                3

                CHAPTER 1 INTRODUCTION

                can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

                Figure 11 Proposed methodology for the thesis

                4

                Chapter 2

                Frame of Reference

                This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

                21 Filtration amp Clogging Indicators

                Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

                To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

                211 Basket Filter

                A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

                5

                CHAPTER 2 FRAME OF REFERENCE

                Figure 21 An overview of a basket filter1

                The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

                212 Self-Cleaning Basket Filters

                Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

                1Source httpwwwfilter-technicsbe

                6

                21 FILTRATION amp CLOGGING INDICATORS

                Figure 22 An overview of a basket filter with self-cleaning2

                The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

                213 Manometer

                Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

                When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

                2Source httpwwwdirectindustrycom

                7

                CHAPTER 2 FRAME OF REFERENCE

                214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

                1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

                2 a decrease in Q as a result of an increase in ∆p

                These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

                1 steady state ∆p and Qrarr Nolittle clogging

                2 linear increase in ∆p and steady Qrarr Moderate clogging

                3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

                With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

                Figure 23 Visualization of the clogging states3

                3Source Eker et al [6]

                8

                21 FILTRATION amp CLOGGING INDICATORS

                215 Physics-based Modelling

                The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

                Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

                QL = KA

                microL∆p (21)

                rewritten as

                ∆p = microL

                KAQL (22)

                A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

                ∆p = kVsmicro

                Φ2D2p

                (1minus ε)2L

                ε3(23)

                Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

                ∆p = 150Vsmicro(1minus ε)2L

                D2pε

                3 + 175(1minus ε)ρV 2s L

                ε3Dp(24)

                where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

                Table 21 Variable explanation for Ergunrsquos equation

                Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

                Dp Diameter of the spherical particle mρ Density of the liquid kgm3

                9

                CHAPTER 2 FRAME OF REFERENCE

                Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

                22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

                Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

                Table 22 Outputs of a confusion matrix

                PredictionPositive Negative

                Act

                ual Positive True Positive (TP) False Positive (FP)

                Negative False Negative (FN) True Negative (TN)

                The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

                ACC =sumn

                i=1 jin

                where ji =

                1 if yi = yi

                0 if yi 6= yi

                (25)

                by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

                10

                22 PREDICTIVE ANALYTICS

                In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

                221 Classification Error Metrics

                Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

                Area Under Curve (AUC)

                AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

                sensitivity = TP

                TP + FN(26)

                specificity = TN

                TN + FP(27)

                The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

                F1 Score

                The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

                precision = TP

                TP + FP(28)

                recall = TP

                TP + FN(29)

                F1 = 2times precisiontimes recallprecision+ recall

                (210)

                11

                CHAPTER 2 FRAME OF REFERENCE

                Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

                Logarithmic Loss (Log Loss)

                For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

                LogLoss = minusMsum

                c=1yoclog(poc) (211)

                222 Regression Error Metrics

                Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

                Mean Absolute Error (MAE)

                Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

                MAE = 1n

                nsumi=1|yi minus yi| (212)

                Mean Squared Error (MSE)

                The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

                12

                22 PREDICTIVE ANALYTICS

                MSE = 1n

                nsumi=1

                (yi minus yi)2 (213)

                Root Mean Squared Error (RMSE)

                RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

                RMSE =

                radicradicradicradic 1n

                nsumi=1

                (yi minus yi)2 (214)

                The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

                partRMSE

                partyi= 1radic

                MSE

                partMSE

                partyi(215)

                Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

                Mean Square Percentage Error (MSPE)

                The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

                MSPE = 100n

                nsumi=1

                (yi minus yi

                yi

                )2(216)

                Mean Absolute Percentage Error (MAPE)

                The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

                MAPE = 100n

                nsumi=1

                ∣∣∣∣yi minus yi

                yi

                ∣∣∣∣ (217)

                13

                CHAPTER 2 FRAME OF REFERENCE

                Coefficient of Determination r2

                To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

                r2 =

                sumni=1((yi minus yi)(yi minus yi))2radicsumn

                i=1(yi minus yi)2sumni=1(yi minus yi)2

                2

                (218)

                r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

                Adjusted r2

                Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

                r2adj = 1minus

                [(1minusr2)(nminus1)

                nminuskminus1

                ](219)

                Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

                223 Stochastic Time Series Models

                Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

                Autoregressive Moving Average (ARMA)

                The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

                14

                23 NEURAL NETWORKS

                value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

                Autoregressive Integrated Moving Average (ARIMA)

                ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

                A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

                23 Neural Networks

                231 Overview

                NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

                15

                CHAPTER 2 FRAME OF REFERENCE

                properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

                232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

                output =

                0 if w middot x+ b le 01 if w middot x+ b gt 0

                (220)

                In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

                233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

                Sigmoid Function

                The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

                f(z) = σ(z) = 11 + eminusz

                (221)

                for

                z =sum

                j

                wj middot xj + b (222)

                16

                23 NEURAL NETWORKS

                Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

                Rectified Function

                The rectifier activation function is defined as the positive part of its argument [34]

                f(x) = x+ = max(0 x) (223)

                for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

                Swish Function

                Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

                f(x) = x middot sigmoid(βx) (224)

                where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

                234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

                Shallow Neural Networks (SNN)

                SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

                17

                CHAPTER 2 FRAME OF REFERENCE

                tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

                Deep Neural Networks (DNN)

                DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

                f(x) = f (1) + f (2) + + f (n) (225)

                where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

                Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

                Recurring Neural Networks(RNN)

                Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

                x1 =[0 0 1 1 0 0 0

                ]x2 =

                [0 0 0 1 1 0 0

                ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

                18

                23 NEURAL NETWORKS

                weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

                Long Short Term Memory (LSTM) Networks

                In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

                it = σ(ωi

                [htminus1 xt

                ]+ bi)

                ot = σ(ωo

                [htminus1 xt

                ]+ bo)

                ft = σ(ωf

                [htminus1 xt

                ]+ bf )

                (226)

                The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

                Gated Recurrent Units (GRU)

                GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

                19

                CHAPTER 2 FRAME OF REFERENCE

                Convolutional Neural Networks (CNN)

                The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                20

                23 NEURAL NETWORKS

                Figure 25 A max pooling layer with pool size 2 pooling an input

                The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                Figure 26 A flattening layer flattening the feature map

                21

                Chapter 3

                Experimental Development

                This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                Figure 31 A complete test cycle

                23

                CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                Figure 32 A test cycle with the backflush stop cut from the data

                The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                24

                31 DATA GATHERING AND PROCESSING

                Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                25

                CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                Table 31 Amount of data available after preprocessing

                Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                Total 3195 1012 2903

                When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                32 Model Generation

                In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                26

                32 MODEL GENERATION

                variables The encoding can be done for both integers and tags such as123

                rarr1 0 0

                0 1 00 0 1

                or

                redbluegreen

                rarr1 0 0

                0 1 00 0 1

                so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                xi minusmin(x)max(x)minusmin(x) (31)

                Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                ](32)

                X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                ](33)

                27

                CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                bull Samples - The amount of data points

                bull Time steps - The points of observation of the samples

                bull Features - The observed variables per time step

                The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                Figure 35 An overview of the LSTM network architecture

                The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                322 Regression Processing with the CNN Model

                As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                28

                32 MODEL GENERATION

                observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                Figure 36 An overview of the CNN architecture

                Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                323 Label Classification

                With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                29

                CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                33 Model evaluation

                During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                30

                34 HARDWARE SPECIFICATIONS

                Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                34 Hardware Specifications

                The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                31

                CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                32

                Chapter 4

                Results

                This chapter presents the results for all the models presented in the previous chapter

                41 LSTM Performance

                Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                Figure 41 MAE and MSE loss for the LSTM

                33

                CHAPTER 4 RESULTS

                Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                34

                41 LSTM PERFORMANCE

                Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                Table 41 Evaluation metrics for the LSTM during regression analysis

                Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                35

                CHAPTER 4 RESULTS

                Table 42 Evaluation metrics for the LSTM during classification analysis

                of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                Table 43 LSTM confusion matrix

                PredictionLabel 1 Label 2

                Act

                ual Label 1 109 1

                Label 2 3 669

                42 CNN Performance

                Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                Figure 47 MAE and MSE loss for the CNN

                36

                42 CNN PERFORMANCE

                Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                37

                CHAPTER 4 RESULTS

                Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                Table 44 Evaluation metrics for the CNN during regression analysis

                Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                38

                42 CNN PERFORMANCE

                Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                Table 45 Evaluation metrics for the CNN during classification analysis

                Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                Table 46 CNN confusion matrix for data from the MAE regression network

                PredictionLabel 1 Label 2

                Act

                ual Label 1 82 29

                Label 2 38 631

                Table 47 CNN confusion matrix for data from the MSE regression network

                PredictionLabel 1 Label 2

                Act

                ual Label 1 69 41

                Label 2 11 659

                39

                Chapter 5

                Discussion amp Conclusion

                This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                51 The LSTM Network

                511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                41

                CHAPTER 5 DISCUSSION amp CONCLUSION

                while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                512 Classification Analysis

                As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                52 The CNN

                521 Regression Analysis

                The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                42

                52 THE CNN

                is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                522 Classification Analysis

                Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                However the CNN is still doing a good job at predicting future clogging even

                43

                CHAPTER 5 DISCUSSION amp CONCLUSION

                up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                44

                Chapter 6

                Future Work

                In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                45

                Bibliography

                [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                47

                BIBLIOGRAPHY

                [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                48

                BIBLIOGRAPHY

                [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                49

                BIBLIOGRAPHY

                models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                50

                BIBLIOGRAPHY

                [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                51

                TRITA TRITA-ITM-EX 2019606

                wwwkthse

                • Introduction
                  • Background
                  • Problem Description
                  • Purpose Definitions amp Research Questions
                  • Scope and Delimitations
                  • Method Description
                    • Frame of Reference
                      • Filtration amp Clogging Indicators
                        • Basket Filter
                        • Self-Cleaning Basket Filters
                        • Manometer
                        • The Clogging Phenomena
                        • Physics-based Modelling
                          • Predictive Analytics
                            • Classification Error Metrics
                            • Regression Error Metrics
                            • Stochastic Time Series Models
                              • Neural Networks
                                • Overview
                                • The Perceptron
                                • Activation functions
                                • Neural Network Architectures
                                    • Experimental Development
                                      • Data Gathering and Processing
                                      • Model Generation
                                        • Regression Processing with the LSTM Model
                                        • Regression Processing with the CNN Model
                                        • Label Classification
                                          • Model evaluation
                                          • Hardware Specifications
                                            • Results
                                              • LSTM Performance
                                              • CNN Performance
                                                • Discussion amp Conclusion
                                                  • The LSTM Network
                                                    • Regression Analysis
                                                    • Classification Analysis
                                                      • The CNN
                                                        • Regression Analysis
                                                        • Classification Analysis
                                                          • Comparison Between Both Networks
                                                          • Conclusion
                                                            • Future Work
                                                            • Bibliography

                  15 METHOD DESCRIPTION

                  15 Method Description

                  The task at hand is a combination of two On one hand it is an engineering taskinvestigating if an NN can be developed for the BWTS On the other hand it is ascientific task that focuses on applying and evaluating current research in the areasof NNs water filtration techniques time series analysis and predictive analytics tothe problem To complete both tasks within the time frame a methodology has tobe developed to ensure that the engineering task is done the research questions areanswered and that there is a clear foundation for future research The methodologyis visualized in Figure 11

                  The basis of the methodology starts with the problem description The problemdescription makes way for establishing the first frame of reference (FOR) and rais-ing the initial potential research questions Following that a better understandingabout the research field in question and continuous discussions with Alfa Laval willhelp adapt the FOR further Working through this process iteratively a final framecan be established where the research area is clear allowing for finalization of theresearch questions

                  With the frame of reference established the focus shifts to gathering informationthrough appropriate resources such as scientific articles and papers Interviews ex-isting documentation of the BWTS and future use of the NN also helps in narrowingdown the solution space to only contain relevant solutions that are optimal for thethesis With a smaller set of NNs to chose from the best suited network structurewill be developed tested and evaluated

                  In preparation of constructing and implementing the NN the data have to beprocessed according to the needs of the selected network Pre-processing strate-gies on how datasets are best prepared for NN processing will be investigated andexecuted to ensure that a clear methodology for future processing is executed anddocumented For correctly classifying the current clogging grade or rate of cloggingof the filter state of the art research will be used as reference when commencingwith the labelling of the data

                  When the classificationlabelling is done implementation through training and test-ing of the NN can begin Sequentially improvement to the NN structure will bemade by comparing the results of different initial weight conditions layer configu-rations and data partitions The experimental results will be graded in terms ofpredictive accuracy and the estimated errors of each individual parameter will betaken into consideration

                  Lastly the validation process can begin to ensure that the initial requirementsfrom the problem description are either met or that they have been investigatedWith the results at hand a conclusion can be presented describing how the system

                  3

                  CHAPTER 1 INTRODUCTION

                  can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

                  Figure 11 Proposed methodology for the thesis

                  4

                  Chapter 2

                  Frame of Reference

                  This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

                  21 Filtration amp Clogging Indicators

                  Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

                  To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

                  211 Basket Filter

                  A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

                  5

                  CHAPTER 2 FRAME OF REFERENCE

                  Figure 21 An overview of a basket filter1

                  The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

                  212 Self-Cleaning Basket Filters

                  Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

                  1Source httpwwwfilter-technicsbe

                  6

                  21 FILTRATION amp CLOGGING INDICATORS

                  Figure 22 An overview of a basket filter with self-cleaning2

                  The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

                  213 Manometer

                  Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

                  When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

                  2Source httpwwwdirectindustrycom

                  7

                  CHAPTER 2 FRAME OF REFERENCE

                  214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

                  1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

                  2 a decrease in Q as a result of an increase in ∆p

                  These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

                  1 steady state ∆p and Qrarr Nolittle clogging

                  2 linear increase in ∆p and steady Qrarr Moderate clogging

                  3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

                  With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

                  Figure 23 Visualization of the clogging states3

                  3Source Eker et al [6]

                  8

                  21 FILTRATION amp CLOGGING INDICATORS

                  215 Physics-based Modelling

                  The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

                  Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

                  QL = KA

                  microL∆p (21)

                  rewritten as

                  ∆p = microL

                  KAQL (22)

                  A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

                  ∆p = kVsmicro

                  Φ2D2p

                  (1minus ε)2L

                  ε3(23)

                  Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

                  ∆p = 150Vsmicro(1minus ε)2L

                  D2pε

                  3 + 175(1minus ε)ρV 2s L

                  ε3Dp(24)

                  where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

                  Table 21 Variable explanation for Ergunrsquos equation

                  Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

                  Dp Diameter of the spherical particle mρ Density of the liquid kgm3

                  9

                  CHAPTER 2 FRAME OF REFERENCE

                  Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

                  22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

                  Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

                  Table 22 Outputs of a confusion matrix

                  PredictionPositive Negative

                  Act

                  ual Positive True Positive (TP) False Positive (FP)

                  Negative False Negative (FN) True Negative (TN)

                  The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

                  ACC =sumn

                  i=1 jin

                  where ji =

                  1 if yi = yi

                  0 if yi 6= yi

                  (25)

                  by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

                  10

                  22 PREDICTIVE ANALYTICS

                  In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

                  221 Classification Error Metrics

                  Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

                  Area Under Curve (AUC)

                  AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

                  sensitivity = TP

                  TP + FN(26)

                  specificity = TN

                  TN + FP(27)

                  The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

                  F1 Score

                  The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

                  precision = TP

                  TP + FP(28)

                  recall = TP

                  TP + FN(29)

                  F1 = 2times precisiontimes recallprecision+ recall

                  (210)

                  11

                  CHAPTER 2 FRAME OF REFERENCE

                  Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

                  Logarithmic Loss (Log Loss)

                  For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

                  LogLoss = minusMsum

                  c=1yoclog(poc) (211)

                  222 Regression Error Metrics

                  Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

                  Mean Absolute Error (MAE)

                  Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

                  MAE = 1n

                  nsumi=1|yi minus yi| (212)

                  Mean Squared Error (MSE)

                  The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

                  12

                  22 PREDICTIVE ANALYTICS

                  MSE = 1n

                  nsumi=1

                  (yi minus yi)2 (213)

                  Root Mean Squared Error (RMSE)

                  RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

                  RMSE =

                  radicradicradicradic 1n

                  nsumi=1

                  (yi minus yi)2 (214)

                  The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

                  partRMSE

                  partyi= 1radic

                  MSE

                  partMSE

                  partyi(215)

                  Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

                  Mean Square Percentage Error (MSPE)

                  The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

                  MSPE = 100n

                  nsumi=1

                  (yi minus yi

                  yi

                  )2(216)

                  Mean Absolute Percentage Error (MAPE)

                  The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

                  MAPE = 100n

                  nsumi=1

                  ∣∣∣∣yi minus yi

                  yi

                  ∣∣∣∣ (217)

                  13

                  CHAPTER 2 FRAME OF REFERENCE

                  Coefficient of Determination r2

                  To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

                  r2 =

                  sumni=1((yi minus yi)(yi minus yi))2radicsumn

                  i=1(yi minus yi)2sumni=1(yi minus yi)2

                  2

                  (218)

                  r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

                  Adjusted r2

                  Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

                  r2adj = 1minus

                  [(1minusr2)(nminus1)

                  nminuskminus1

                  ](219)

                  Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

                  223 Stochastic Time Series Models

                  Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

                  Autoregressive Moving Average (ARMA)

                  The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

                  14

                  23 NEURAL NETWORKS

                  value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

                  Autoregressive Integrated Moving Average (ARIMA)

                  ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

                  A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

                  23 Neural Networks

                  231 Overview

                  NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

                  15

                  CHAPTER 2 FRAME OF REFERENCE

                  properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

                  232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

                  output =

                  0 if w middot x+ b le 01 if w middot x+ b gt 0

                  (220)

                  In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

                  233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

                  Sigmoid Function

                  The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

                  f(z) = σ(z) = 11 + eminusz

                  (221)

                  for

                  z =sum

                  j

                  wj middot xj + b (222)

                  16

                  23 NEURAL NETWORKS

                  Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

                  Rectified Function

                  The rectifier activation function is defined as the positive part of its argument [34]

                  f(x) = x+ = max(0 x) (223)

                  for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

                  Swish Function

                  Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

                  f(x) = x middot sigmoid(βx) (224)

                  where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

                  234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

                  Shallow Neural Networks (SNN)

                  SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

                  17

                  CHAPTER 2 FRAME OF REFERENCE

                  tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

                  Deep Neural Networks (DNN)

                  DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

                  f(x) = f (1) + f (2) + + f (n) (225)

                  where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

                  Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

                  Recurring Neural Networks(RNN)

                  Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

                  x1 =[0 0 1 1 0 0 0

                  ]x2 =

                  [0 0 0 1 1 0 0

                  ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

                  18

                  23 NEURAL NETWORKS

                  weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

                  Long Short Term Memory (LSTM) Networks

                  In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

                  it = σ(ωi

                  [htminus1 xt

                  ]+ bi)

                  ot = σ(ωo

                  [htminus1 xt

                  ]+ bo)

                  ft = σ(ωf

                  [htminus1 xt

                  ]+ bf )

                  (226)

                  The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

                  Gated Recurrent Units (GRU)

                  GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

                  19

                  CHAPTER 2 FRAME OF REFERENCE

                  Convolutional Neural Networks (CNN)

                  The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                  The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                  Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                  Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                  20

                  23 NEURAL NETWORKS

                  Figure 25 A max pooling layer with pool size 2 pooling an input

                  The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                  Figure 26 A flattening layer flattening the feature map

                  21

                  Chapter 3

                  Experimental Development

                  This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                  31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                  Figure 31 A complete test cycle

                  23

                  CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                  During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                  Figure 32 A test cycle with the backflush stop cut from the data

                  The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                  24

                  31 DATA GATHERING AND PROCESSING

                  Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                  Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                  Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                  As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                  25

                  CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                  the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                  Table 31 Amount of data available after preprocessing

                  Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                  Total 3195 1012 2903

                  When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                  32 Model Generation

                  In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                  Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                  The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                  26

                  32 MODEL GENERATION

                  variables The encoding can be done for both integers and tags such as123

                  rarr1 0 0

                  0 1 00 0 1

                  or

                  redbluegreen

                  rarr1 0 0

                  0 1 00 0 1

                  so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                  The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                  xi minusmin(x)max(x)minusmin(x) (31)

                  Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                  321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                  X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                  ](32)

                  X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                  ](33)

                  27

                  CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                  When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                  bull Samples - The amount of data points

                  bull Time steps - The points of observation of the samples

                  bull Features - The observed variables per time step

                  The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                  Figure 35 An overview of the LSTM network architecture

                  The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                  322 Regression Processing with the CNN Model

                  As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                  28

                  32 MODEL GENERATION

                  observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                  The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                  Figure 36 An overview of the CNN architecture

                  Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                  323 Label Classification

                  With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                  For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                  29

                  CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                  20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                  For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                  33 Model evaluation

                  During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                  For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                  For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                  30

                  34 HARDWARE SPECIFICATIONS

                  Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                  34 Hardware Specifications

                  The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                  Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                  31

                  CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                  The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                  The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                  32

                  Chapter 4

                  Results

                  This chapter presents the results for all the models presented in the previous chapter

                  41 LSTM Performance

                  Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                  Figure 41 MAE and MSE loss for the LSTM

                  33

                  CHAPTER 4 RESULTS

                  Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                  Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                  Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                  34

                  41 LSTM PERFORMANCE

                  Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                  Table 41 Evaluation metrics for the LSTM during regression analysis

                  Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                  Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                  Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                  35

                  CHAPTER 4 RESULTS

                  Table 42 Evaluation metrics for the LSTM during classification analysis

                  of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                  Table 43 LSTM confusion matrix

                  PredictionLabel 1 Label 2

                  Act

                  ual Label 1 109 1

                  Label 2 3 669

                  42 CNN Performance

                  Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                  Figure 47 MAE and MSE loss for the CNN

                  36

                  42 CNN PERFORMANCE

                  Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                  Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                  Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                  37

                  CHAPTER 4 RESULTS

                  Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                  Table 44 Evaluation metrics for the CNN during regression analysis

                  Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                  Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                  Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                  38

                  42 CNN PERFORMANCE

                  Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                  Table 45 Evaluation metrics for the CNN during classification analysis

                  Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                  Table 46 CNN confusion matrix for data from the MAE regression network

                  PredictionLabel 1 Label 2

                  Act

                  ual Label 1 82 29

                  Label 2 38 631

                  Table 47 CNN confusion matrix for data from the MSE regression network

                  PredictionLabel 1 Label 2

                  Act

                  ual Label 1 69 41

                  Label 2 11 659

                  39

                  Chapter 5

                  Discussion amp Conclusion

                  This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                  51 The LSTM Network

                  511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                  Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                  The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                  41

                  CHAPTER 5 DISCUSSION amp CONCLUSION

                  while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                  512 Classification Analysis

                  As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                  The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                  52 The CNN

                  521 Regression Analysis

                  The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                  Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                  42

                  52 THE CNN

                  is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                  Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                  522 Classification Analysis

                  Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                  Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                  However the CNN is still doing a good job at predicting future clogging even

                  43

                  CHAPTER 5 DISCUSSION amp CONCLUSION

                  up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                  53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                  54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                  As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                  44

                  Chapter 6

                  Future Work

                  In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                  For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                  On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                  Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                  45

                  Bibliography

                  [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                  [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                  [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                  [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                  [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                  [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                  [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                  [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                  [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                  [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                  47

                  BIBLIOGRAPHY

                  [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                  [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                  [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                  [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                  [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                  [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                  [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                  [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                  [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                  [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                  [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                  48

                  BIBLIOGRAPHY

                  [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                  [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                  [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                  [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                  [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                  [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                  [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                  [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                  [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                  [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                  [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                  [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                  49

                  BIBLIOGRAPHY

                  models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                  [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                  [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                  [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                  [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                  [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                  [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                  [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                  [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                  [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                  [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                  50

                  BIBLIOGRAPHY

                  [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                  [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                  [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                  [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                  [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                  [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                  [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                  51

                  TRITA TRITA-ITM-EX 2019606

                  wwwkthse

                  • Introduction
                    • Background
                    • Problem Description
                    • Purpose Definitions amp Research Questions
                    • Scope and Delimitations
                    • Method Description
                      • Frame of Reference
                        • Filtration amp Clogging Indicators
                          • Basket Filter
                          • Self-Cleaning Basket Filters
                          • Manometer
                          • The Clogging Phenomena
                          • Physics-based Modelling
                            • Predictive Analytics
                              • Classification Error Metrics
                              • Regression Error Metrics
                              • Stochastic Time Series Models
                                • Neural Networks
                                  • Overview
                                  • The Perceptron
                                  • Activation functions
                                  • Neural Network Architectures
                                      • Experimental Development
                                        • Data Gathering and Processing
                                        • Model Generation
                                          • Regression Processing with the LSTM Model
                                          • Regression Processing with the CNN Model
                                          • Label Classification
                                            • Model evaluation
                                            • Hardware Specifications
                                              • Results
                                                • LSTM Performance
                                                • CNN Performance
                                                  • Discussion amp Conclusion
                                                    • The LSTM Network
                                                      • Regression Analysis
                                                      • Classification Analysis
                                                        • The CNN
                                                          • Regression Analysis
                                                          • Classification Analysis
                                                            • Comparison Between Both Networks
                                                            • Conclusion
                                                              • Future Work
                                                              • Bibliography

                    CHAPTER 1 INTRODUCTION

                    can be adapted to detect clogging Suggestions on how the system can be furtherimproved upon and other future work will also be mentioned

                    Figure 11 Proposed methodology for the thesis

                    4

                    Chapter 2

                    Frame of Reference

                    This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

                    21 Filtration amp Clogging Indicators

                    Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

                    To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

                    211 Basket Filter

                    A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

                    5

                    CHAPTER 2 FRAME OF REFERENCE

                    Figure 21 An overview of a basket filter1

                    The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

                    212 Self-Cleaning Basket Filters

                    Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

                    1Source httpwwwfilter-technicsbe

                    6

                    21 FILTRATION amp CLOGGING INDICATORS

                    Figure 22 An overview of a basket filter with self-cleaning2

                    The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

                    213 Manometer

                    Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

                    When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

                    2Source httpwwwdirectindustrycom

                    7

                    CHAPTER 2 FRAME OF REFERENCE

                    214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

                    1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

                    2 a decrease in Q as a result of an increase in ∆p

                    These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

                    1 steady state ∆p and Qrarr Nolittle clogging

                    2 linear increase in ∆p and steady Qrarr Moderate clogging

                    3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

                    With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

                    Figure 23 Visualization of the clogging states3

                    3Source Eker et al [6]

                    8

                    21 FILTRATION amp CLOGGING INDICATORS

                    215 Physics-based Modelling

                    The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

                    Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

                    QL = KA

                    microL∆p (21)

                    rewritten as

                    ∆p = microL

                    KAQL (22)

                    A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

                    ∆p = kVsmicro

                    Φ2D2p

                    (1minus ε)2L

                    ε3(23)

                    Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

                    ∆p = 150Vsmicro(1minus ε)2L

                    D2pε

                    3 + 175(1minus ε)ρV 2s L

                    ε3Dp(24)

                    where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

                    Table 21 Variable explanation for Ergunrsquos equation

                    Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

                    Dp Diameter of the spherical particle mρ Density of the liquid kgm3

                    9

                    CHAPTER 2 FRAME OF REFERENCE

                    Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

                    22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

                    Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

                    Table 22 Outputs of a confusion matrix

                    PredictionPositive Negative

                    Act

                    ual Positive True Positive (TP) False Positive (FP)

                    Negative False Negative (FN) True Negative (TN)

                    The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

                    ACC =sumn

                    i=1 jin

                    where ji =

                    1 if yi = yi

                    0 if yi 6= yi

                    (25)

                    by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

                    10

                    22 PREDICTIVE ANALYTICS

                    In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

                    221 Classification Error Metrics

                    Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

                    Area Under Curve (AUC)

                    AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

                    sensitivity = TP

                    TP + FN(26)

                    specificity = TN

                    TN + FP(27)

                    The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

                    F1 Score

                    The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

                    precision = TP

                    TP + FP(28)

                    recall = TP

                    TP + FN(29)

                    F1 = 2times precisiontimes recallprecision+ recall

                    (210)

                    11

                    CHAPTER 2 FRAME OF REFERENCE

                    Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

                    Logarithmic Loss (Log Loss)

                    For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

                    LogLoss = minusMsum

                    c=1yoclog(poc) (211)

                    222 Regression Error Metrics

                    Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

                    Mean Absolute Error (MAE)

                    Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

                    MAE = 1n

                    nsumi=1|yi minus yi| (212)

                    Mean Squared Error (MSE)

                    The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

                    12

                    22 PREDICTIVE ANALYTICS

                    MSE = 1n

                    nsumi=1

                    (yi minus yi)2 (213)

                    Root Mean Squared Error (RMSE)

                    RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

                    RMSE =

                    radicradicradicradic 1n

                    nsumi=1

                    (yi minus yi)2 (214)

                    The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

                    partRMSE

                    partyi= 1radic

                    MSE

                    partMSE

                    partyi(215)

                    Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

                    Mean Square Percentage Error (MSPE)

                    The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

                    MSPE = 100n

                    nsumi=1

                    (yi minus yi

                    yi

                    )2(216)

                    Mean Absolute Percentage Error (MAPE)

                    The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

                    MAPE = 100n

                    nsumi=1

                    ∣∣∣∣yi minus yi

                    yi

                    ∣∣∣∣ (217)

                    13

                    CHAPTER 2 FRAME OF REFERENCE

                    Coefficient of Determination r2

                    To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

                    r2 =

                    sumni=1((yi minus yi)(yi minus yi))2radicsumn

                    i=1(yi minus yi)2sumni=1(yi minus yi)2

                    2

                    (218)

                    r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

                    Adjusted r2

                    Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

                    r2adj = 1minus

                    [(1minusr2)(nminus1)

                    nminuskminus1

                    ](219)

                    Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

                    223 Stochastic Time Series Models

                    Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

                    Autoregressive Moving Average (ARMA)

                    The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

                    14

                    23 NEURAL NETWORKS

                    value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

                    Autoregressive Integrated Moving Average (ARIMA)

                    ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

                    A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

                    23 Neural Networks

                    231 Overview

                    NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

                    15

                    CHAPTER 2 FRAME OF REFERENCE

                    properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

                    232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

                    output =

                    0 if w middot x+ b le 01 if w middot x+ b gt 0

                    (220)

                    In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

                    233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

                    Sigmoid Function

                    The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

                    f(z) = σ(z) = 11 + eminusz

                    (221)

                    for

                    z =sum

                    j

                    wj middot xj + b (222)

                    16

                    23 NEURAL NETWORKS

                    Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

                    Rectified Function

                    The rectifier activation function is defined as the positive part of its argument [34]

                    f(x) = x+ = max(0 x) (223)

                    for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

                    Swish Function

                    Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

                    f(x) = x middot sigmoid(βx) (224)

                    where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

                    234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

                    Shallow Neural Networks (SNN)

                    SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

                    17

                    CHAPTER 2 FRAME OF REFERENCE

                    tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

                    Deep Neural Networks (DNN)

                    DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

                    f(x) = f (1) + f (2) + + f (n) (225)

                    where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

                    Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

                    Recurring Neural Networks(RNN)

                    Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

                    x1 =[0 0 1 1 0 0 0

                    ]x2 =

                    [0 0 0 1 1 0 0

                    ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

                    18

                    23 NEURAL NETWORKS

                    weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

                    Long Short Term Memory (LSTM) Networks

                    In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

                    it = σ(ωi

                    [htminus1 xt

                    ]+ bi)

                    ot = σ(ωo

                    [htminus1 xt

                    ]+ bo)

                    ft = σ(ωf

                    [htminus1 xt

                    ]+ bf )

                    (226)

                    The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

                    Gated Recurrent Units (GRU)

                    GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

                    19

                    CHAPTER 2 FRAME OF REFERENCE

                    Convolutional Neural Networks (CNN)

                    The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                    The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                    Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                    Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                    20

                    23 NEURAL NETWORKS

                    Figure 25 A max pooling layer with pool size 2 pooling an input

                    The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                    Figure 26 A flattening layer flattening the feature map

                    21

                    Chapter 3

                    Experimental Development

                    This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                    31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                    Figure 31 A complete test cycle

                    23

                    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                    During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                    Figure 32 A test cycle with the backflush stop cut from the data

                    The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                    24

                    31 DATA GATHERING AND PROCESSING

                    Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                    Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                    Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                    As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                    25

                    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                    the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                    Table 31 Amount of data available after preprocessing

                    Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                    Total 3195 1012 2903

                    When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                    32 Model Generation

                    In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                    Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                    The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                    26

                    32 MODEL GENERATION

                    variables The encoding can be done for both integers and tags such as123

                    rarr1 0 0

                    0 1 00 0 1

                    or

                    redbluegreen

                    rarr1 0 0

                    0 1 00 0 1

                    so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                    The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                    xi minusmin(x)max(x)minusmin(x) (31)

                    Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                    321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                    X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                    ](32)

                    X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                    ](33)

                    27

                    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                    When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                    bull Samples - The amount of data points

                    bull Time steps - The points of observation of the samples

                    bull Features - The observed variables per time step

                    The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                    Figure 35 An overview of the LSTM network architecture

                    The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                    322 Regression Processing with the CNN Model

                    As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                    28

                    32 MODEL GENERATION

                    observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                    The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                    Figure 36 An overview of the CNN architecture

                    Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                    323 Label Classification

                    With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                    For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                    29

                    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                    20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                    For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                    33 Model evaluation

                    During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                    For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                    For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                    30

                    34 HARDWARE SPECIFICATIONS

                    Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                    34 Hardware Specifications

                    The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                    Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                    31

                    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                    The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                    The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                    32

                    Chapter 4

                    Results

                    This chapter presents the results for all the models presented in the previous chapter

                    41 LSTM Performance

                    Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                    Figure 41 MAE and MSE loss for the LSTM

                    33

                    CHAPTER 4 RESULTS

                    Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                    Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                    Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                    34

                    41 LSTM PERFORMANCE

                    Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                    Table 41 Evaluation metrics for the LSTM during regression analysis

                    Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                    Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                    Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                    35

                    CHAPTER 4 RESULTS

                    Table 42 Evaluation metrics for the LSTM during classification analysis

                    of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                    Table 43 LSTM confusion matrix

                    PredictionLabel 1 Label 2

                    Act

                    ual Label 1 109 1

                    Label 2 3 669

                    42 CNN Performance

                    Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                    Figure 47 MAE and MSE loss for the CNN

                    36

                    42 CNN PERFORMANCE

                    Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                    Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                    Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                    37

                    CHAPTER 4 RESULTS

                    Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                    Table 44 Evaluation metrics for the CNN during regression analysis

                    Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                    Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                    Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                    38

                    42 CNN PERFORMANCE

                    Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                    Table 45 Evaluation metrics for the CNN during classification analysis

                    Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                    Table 46 CNN confusion matrix for data from the MAE regression network

                    PredictionLabel 1 Label 2

                    Act

                    ual Label 1 82 29

                    Label 2 38 631

                    Table 47 CNN confusion matrix for data from the MSE regression network

                    PredictionLabel 1 Label 2

                    Act

                    ual Label 1 69 41

                    Label 2 11 659

                    39

                    Chapter 5

                    Discussion amp Conclusion

                    This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                    51 The LSTM Network

                    511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                    Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                    The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                    41

                    CHAPTER 5 DISCUSSION amp CONCLUSION

                    while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                    512 Classification Analysis

                    As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                    The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                    52 The CNN

                    521 Regression Analysis

                    The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                    Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                    42

                    52 THE CNN

                    is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                    Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                    522 Classification Analysis

                    Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                    Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                    However the CNN is still doing a good job at predicting future clogging even

                    43

                    CHAPTER 5 DISCUSSION amp CONCLUSION

                    up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                    53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                    54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                    As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                    44

                    Chapter 6

                    Future Work

                    In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                    For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                    On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                    Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                    45

                    Bibliography

                    [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                    [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                    [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                    [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                    [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                    [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                    [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                    [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                    [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                    [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                    47

                    BIBLIOGRAPHY

                    [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                    [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                    [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                    [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                    [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                    [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                    [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                    [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                    [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                    [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                    [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                    48

                    BIBLIOGRAPHY

                    [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                    [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                    [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                    [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                    [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                    [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                    [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                    [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                    [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                    [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                    [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                    [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                    49

                    BIBLIOGRAPHY

                    models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                    [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                    [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                    [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                    [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                    [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                    [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                    [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                    [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                    [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                    [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                    50

                    BIBLIOGRAPHY

                    [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                    [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                    [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                    [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                    [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                    [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                    [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                    51

                    TRITA TRITA-ITM-EX 2019606

                    wwwkthse

                    • Introduction
                      • Background
                      • Problem Description
                      • Purpose Definitions amp Research Questions
                      • Scope and Delimitations
                      • Method Description
                        • Frame of Reference
                          • Filtration amp Clogging Indicators
                            • Basket Filter
                            • Self-Cleaning Basket Filters
                            • Manometer
                            • The Clogging Phenomena
                            • Physics-based Modelling
                              • Predictive Analytics
                                • Classification Error Metrics
                                • Regression Error Metrics
                                • Stochastic Time Series Models
                                  • Neural Networks
                                    • Overview
                                    • The Perceptron
                                    • Activation functions
                                    • Neural Network Architectures
                                        • Experimental Development
                                          • Data Gathering and Processing
                                          • Model Generation
                                            • Regression Processing with the LSTM Model
                                            • Regression Processing with the CNN Model
                                            • Label Classification
                                              • Model evaluation
                                              • Hardware Specifications
                                                • Results
                                                  • LSTM Performance
                                                  • CNN Performance
                                                    • Discussion amp Conclusion
                                                      • The LSTM Network
                                                        • Regression Analysis
                                                        • Classification Analysis
                                                          • The CNN
                                                            • Regression Analysis
                                                            • Classification Analysis
                                                              • Comparison Between Both Networks
                                                              • Conclusion
                                                                • Future Work
                                                                • Bibliography

                      Chapter 2

                      Frame of Reference

                      This chapter contains a state of the art review of existing technology and introducesthe reader to the science and terminology used throughout this thesis The systemand its components thatrsquos being used today is analysed and evaluated

                      21 Filtration amp Clogging Indicators

                      Filtration is the technique of separating particles from a mixture to obtain a filtrateIn water filtration the water is typically passed through a fine mesh strainer or aporous medium for the removal of total suspended solids TSS Removal of particlesin this fashion leads to the formation of a filter cake that diminishes the permeablecapability of the filter As the cake grows larger the water can eventually no longerpass and the filter ends up being clogged

                      To better understand how the choice of filter impacts the filtration process andhow filter clogging can be modelled the following section explores research and lit-erature relevant to the BWTS Focus is on filters of the basket type and where thefiltration is done with regards to water

                      211 Basket Filter

                      A basket filter uses a cylindrical metal strainer located inside a pressure vessel forfiltering and is shown in Figure 21 The strainer is either composed of reinforcedwire mesh or perforated sheet metal which the liquid flows through Sometimes acombination of the two is used During filtration organisms and TSS accumulatein the basket strainer and can only be removed by physically removing the strainerand scraping off the particles using a scraper or a brush [3] An estimate of howmany particles that have accumulated in the filter can typically be obtained fromthe readings of a manometer which measures the differential pressure over the filter(see 213)

                      5

                      CHAPTER 2 FRAME OF REFERENCE

                      Figure 21 An overview of a basket filter1

                      The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

                      212 Self-Cleaning Basket Filters

                      Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

                      1Source httpwwwfilter-technicsbe

                      6

                      21 FILTRATION amp CLOGGING INDICATORS

                      Figure 22 An overview of a basket filter with self-cleaning2

                      The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

                      213 Manometer

                      Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

                      When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

                      2Source httpwwwdirectindustrycom

                      7

                      CHAPTER 2 FRAME OF REFERENCE

                      214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

                      1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

                      2 a decrease in Q as a result of an increase in ∆p

                      These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

                      1 steady state ∆p and Qrarr Nolittle clogging

                      2 linear increase in ∆p and steady Qrarr Moderate clogging

                      3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

                      With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

                      Figure 23 Visualization of the clogging states3

                      3Source Eker et al [6]

                      8

                      21 FILTRATION amp CLOGGING INDICATORS

                      215 Physics-based Modelling

                      The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

                      Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

                      QL = KA

                      microL∆p (21)

                      rewritten as

                      ∆p = microL

                      KAQL (22)

                      A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

                      ∆p = kVsmicro

                      Φ2D2p

                      (1minus ε)2L

                      ε3(23)

                      Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

                      ∆p = 150Vsmicro(1minus ε)2L

                      D2pε

                      3 + 175(1minus ε)ρV 2s L

                      ε3Dp(24)

                      where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

                      Table 21 Variable explanation for Ergunrsquos equation

                      Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

                      Dp Diameter of the spherical particle mρ Density of the liquid kgm3

                      9

                      CHAPTER 2 FRAME OF REFERENCE

                      Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

                      22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

                      Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

                      Table 22 Outputs of a confusion matrix

                      PredictionPositive Negative

                      Act

                      ual Positive True Positive (TP) False Positive (FP)

                      Negative False Negative (FN) True Negative (TN)

                      The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

                      ACC =sumn

                      i=1 jin

                      where ji =

                      1 if yi = yi

                      0 if yi 6= yi

                      (25)

                      by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

                      10

                      22 PREDICTIVE ANALYTICS

                      In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

                      221 Classification Error Metrics

                      Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

                      Area Under Curve (AUC)

                      AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

                      sensitivity = TP

                      TP + FN(26)

                      specificity = TN

                      TN + FP(27)

                      The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

                      F1 Score

                      The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

                      precision = TP

                      TP + FP(28)

                      recall = TP

                      TP + FN(29)

                      F1 = 2times precisiontimes recallprecision+ recall

                      (210)

                      11

                      CHAPTER 2 FRAME OF REFERENCE

                      Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

                      Logarithmic Loss (Log Loss)

                      For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

                      LogLoss = minusMsum

                      c=1yoclog(poc) (211)

                      222 Regression Error Metrics

                      Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

                      Mean Absolute Error (MAE)

                      Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

                      MAE = 1n

                      nsumi=1|yi minus yi| (212)

                      Mean Squared Error (MSE)

                      The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

                      12

                      22 PREDICTIVE ANALYTICS

                      MSE = 1n

                      nsumi=1

                      (yi minus yi)2 (213)

                      Root Mean Squared Error (RMSE)

                      RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

                      RMSE =

                      radicradicradicradic 1n

                      nsumi=1

                      (yi minus yi)2 (214)

                      The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

                      partRMSE

                      partyi= 1radic

                      MSE

                      partMSE

                      partyi(215)

                      Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

                      Mean Square Percentage Error (MSPE)

                      The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

                      MSPE = 100n

                      nsumi=1

                      (yi minus yi

                      yi

                      )2(216)

                      Mean Absolute Percentage Error (MAPE)

                      The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

                      MAPE = 100n

                      nsumi=1

                      ∣∣∣∣yi minus yi

                      yi

                      ∣∣∣∣ (217)

                      13

                      CHAPTER 2 FRAME OF REFERENCE

                      Coefficient of Determination r2

                      To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

                      r2 =

                      sumni=1((yi minus yi)(yi minus yi))2radicsumn

                      i=1(yi minus yi)2sumni=1(yi minus yi)2

                      2

                      (218)

                      r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

                      Adjusted r2

                      Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

                      r2adj = 1minus

                      [(1minusr2)(nminus1)

                      nminuskminus1

                      ](219)

                      Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

                      223 Stochastic Time Series Models

                      Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

                      Autoregressive Moving Average (ARMA)

                      The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

                      14

                      23 NEURAL NETWORKS

                      value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

                      Autoregressive Integrated Moving Average (ARIMA)

                      ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

                      A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

                      23 Neural Networks

                      231 Overview

                      NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

                      15

                      CHAPTER 2 FRAME OF REFERENCE

                      properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

                      232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

                      output =

                      0 if w middot x+ b le 01 if w middot x+ b gt 0

                      (220)

                      In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

                      233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

                      Sigmoid Function

                      The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

                      f(z) = σ(z) = 11 + eminusz

                      (221)

                      for

                      z =sum

                      j

                      wj middot xj + b (222)

                      16

                      23 NEURAL NETWORKS

                      Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

                      Rectified Function

                      The rectifier activation function is defined as the positive part of its argument [34]

                      f(x) = x+ = max(0 x) (223)

                      for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

                      Swish Function

                      Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

                      f(x) = x middot sigmoid(βx) (224)

                      where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

                      234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

                      Shallow Neural Networks (SNN)

                      SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

                      17

                      CHAPTER 2 FRAME OF REFERENCE

                      tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

                      Deep Neural Networks (DNN)

                      DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

                      f(x) = f (1) + f (2) + + f (n) (225)

                      where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

                      Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

                      Recurring Neural Networks(RNN)

                      Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

                      x1 =[0 0 1 1 0 0 0

                      ]x2 =

                      [0 0 0 1 1 0 0

                      ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

                      18

                      23 NEURAL NETWORKS

                      weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

                      Long Short Term Memory (LSTM) Networks

                      In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

                      it = σ(ωi

                      [htminus1 xt

                      ]+ bi)

                      ot = σ(ωo

                      [htminus1 xt

                      ]+ bo)

                      ft = σ(ωf

                      [htminus1 xt

                      ]+ bf )

                      (226)

                      The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

                      Gated Recurrent Units (GRU)

                      GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

                      19

                      CHAPTER 2 FRAME OF REFERENCE

                      Convolutional Neural Networks (CNN)

                      The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                      The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                      Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                      Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                      20

                      23 NEURAL NETWORKS

                      Figure 25 A max pooling layer with pool size 2 pooling an input

                      The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                      Figure 26 A flattening layer flattening the feature map

                      21

                      Chapter 3

                      Experimental Development

                      This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                      31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                      Figure 31 A complete test cycle

                      23

                      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                      During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                      Figure 32 A test cycle with the backflush stop cut from the data

                      The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                      24

                      31 DATA GATHERING AND PROCESSING

                      Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                      Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                      Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                      As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                      25

                      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                      the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                      Table 31 Amount of data available after preprocessing

                      Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                      Total 3195 1012 2903

                      When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                      32 Model Generation

                      In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                      Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                      The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                      26

                      32 MODEL GENERATION

                      variables The encoding can be done for both integers and tags such as123

                      rarr1 0 0

                      0 1 00 0 1

                      or

                      redbluegreen

                      rarr1 0 0

                      0 1 00 0 1

                      so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                      The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                      xi minusmin(x)max(x)minusmin(x) (31)

                      Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                      321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                      X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                      ](32)

                      X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                      ](33)

                      27

                      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                      When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                      bull Samples - The amount of data points

                      bull Time steps - The points of observation of the samples

                      bull Features - The observed variables per time step

                      The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                      Figure 35 An overview of the LSTM network architecture

                      The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                      322 Regression Processing with the CNN Model

                      As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                      28

                      32 MODEL GENERATION

                      observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                      The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                      Figure 36 An overview of the CNN architecture

                      Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                      323 Label Classification

                      With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                      For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                      29

                      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                      20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                      For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                      33 Model evaluation

                      During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                      For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                      For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                      30

                      34 HARDWARE SPECIFICATIONS

                      Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                      34 Hardware Specifications

                      The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                      Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                      31

                      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                      The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                      The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                      32

                      Chapter 4

                      Results

                      This chapter presents the results for all the models presented in the previous chapter

                      41 LSTM Performance

                      Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                      Figure 41 MAE and MSE loss for the LSTM

                      33

                      CHAPTER 4 RESULTS

                      Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                      Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                      Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                      34

                      41 LSTM PERFORMANCE

                      Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                      Table 41 Evaluation metrics for the LSTM during regression analysis

                      Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                      Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                      Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                      35

                      CHAPTER 4 RESULTS

                      Table 42 Evaluation metrics for the LSTM during classification analysis

                      of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                      Table 43 LSTM confusion matrix

                      PredictionLabel 1 Label 2

                      Act

                      ual Label 1 109 1

                      Label 2 3 669

                      42 CNN Performance

                      Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                      Figure 47 MAE and MSE loss for the CNN

                      36

                      42 CNN PERFORMANCE

                      Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                      Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                      Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                      37

                      CHAPTER 4 RESULTS

                      Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                      Table 44 Evaluation metrics for the CNN during regression analysis

                      Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                      Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                      Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                      38

                      42 CNN PERFORMANCE

                      Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                      Table 45 Evaluation metrics for the CNN during classification analysis

                      Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                      Table 46 CNN confusion matrix for data from the MAE regression network

                      PredictionLabel 1 Label 2

                      Act

                      ual Label 1 82 29

                      Label 2 38 631

                      Table 47 CNN confusion matrix for data from the MSE regression network

                      PredictionLabel 1 Label 2

                      Act

                      ual Label 1 69 41

                      Label 2 11 659

                      39

                      Chapter 5

                      Discussion amp Conclusion

                      This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                      51 The LSTM Network

                      511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                      Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                      The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                      41

                      CHAPTER 5 DISCUSSION amp CONCLUSION

                      while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                      512 Classification Analysis

                      As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                      The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                      52 The CNN

                      521 Regression Analysis

                      The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                      Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                      42

                      52 THE CNN

                      is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                      Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                      522 Classification Analysis

                      Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                      Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                      However the CNN is still doing a good job at predicting future clogging even

                      43

                      CHAPTER 5 DISCUSSION amp CONCLUSION

                      up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                      53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                      54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                      As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                      44

                      Chapter 6

                      Future Work

                      In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                      For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                      On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                      Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                      45

                      Bibliography

                      [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                      [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                      [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                      [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                      [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                      [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                      [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                      [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                      [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                      [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                      47

                      BIBLIOGRAPHY

                      [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                      [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                      [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                      [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                      [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                      [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                      [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                      [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                      [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                      [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                      [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                      48

                      BIBLIOGRAPHY

                      [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                      [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                      [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                      [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                      [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                      [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                      [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                      [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                      [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                      [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                      [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                      [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                      49

                      BIBLIOGRAPHY

                      models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                      [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                      [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                      [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                      [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                      [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                      [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                      [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                      [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                      [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                      [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                      50

                      BIBLIOGRAPHY

                      [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                      [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                      [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                      [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                      [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                      [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                      [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                      51

                      TRITA TRITA-ITM-EX 2019606

                      wwwkthse

                      • Introduction
                        • Background
                        • Problem Description
                        • Purpose Definitions amp Research Questions
                        • Scope and Delimitations
                        • Method Description
                          • Frame of Reference
                            • Filtration amp Clogging Indicators
                              • Basket Filter
                              • Self-Cleaning Basket Filters
                              • Manometer
                              • The Clogging Phenomena
                              • Physics-based Modelling
                                • Predictive Analytics
                                  • Classification Error Metrics
                                  • Regression Error Metrics
                                  • Stochastic Time Series Models
                                    • Neural Networks
                                      • Overview
                                      • The Perceptron
                                      • Activation functions
                                      • Neural Network Architectures
                                          • Experimental Development
                                            • Data Gathering and Processing
                                            • Model Generation
                                              • Regression Processing with the LSTM Model
                                              • Regression Processing with the CNN Model
                                              • Label Classification
                                                • Model evaluation
                                                • Hardware Specifications
                                                  • Results
                                                    • LSTM Performance
                                                    • CNN Performance
                                                      • Discussion amp Conclusion
                                                        • The LSTM Network
                                                          • Regression Analysis
                                                          • Classification Analysis
                                                            • The CNN
                                                              • Regression Analysis
                                                              • Classification Analysis
                                                                • Comparison Between Both Networks
                                                                • Conclusion
                                                                  • Future Work
                                                                  • Bibliography

                        CHAPTER 2 FRAME OF REFERENCE

                        Figure 21 An overview of a basket filter1

                        The pressure vessel has one inlet for incoming water and one outlet for the filtrateThe pressure difference between the incoming and the outgoing water measures thedifferential pressure ∆p over the filter through two pressure transducers

                        212 Self-Cleaning Basket Filters

                        Basket filters also exist in the category of being self-cleaning A self-cleaning bas-ket filter features a backwashing (also referred to as backflush) mechanism whichautomatically cleans the filter avoiding the need of having to physically remove thefilter in order to clean it and is shown in Figure 22 The backwashing mechanismcomes with the inclusion of a rotary shaft through the center axis of the basket filterthat is connected to a motor for rotation of the shaft [3] The rotary shaft holdsa mouthpiece that is connected to a second outlet which allows for the removal ofparticles caught by the filter

                        1Source httpwwwfilter-technicsbe

                        6

                        21 FILTRATION amp CLOGGING INDICATORS

                        Figure 22 An overview of a basket filter with self-cleaning2

                        The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

                        213 Manometer

                        Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

                        When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

                        2Source httpwwwdirectindustrycom

                        7

                        CHAPTER 2 FRAME OF REFERENCE

                        214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

                        1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

                        2 a decrease in Q as a result of an increase in ∆p

                        These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

                        1 steady state ∆p and Qrarr Nolittle clogging

                        2 linear increase in ∆p and steady Qrarr Moderate clogging

                        3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

                        With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

                        Figure 23 Visualization of the clogging states3

                        3Source Eker et al [6]

                        8

                        21 FILTRATION amp CLOGGING INDICATORS

                        215 Physics-based Modelling

                        The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

                        Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

                        QL = KA

                        microL∆p (21)

                        rewritten as

                        ∆p = microL

                        KAQL (22)

                        A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

                        ∆p = kVsmicro

                        Φ2D2p

                        (1minus ε)2L

                        ε3(23)

                        Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

                        ∆p = 150Vsmicro(1minus ε)2L

                        D2pε

                        3 + 175(1minus ε)ρV 2s L

                        ε3Dp(24)

                        where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

                        Table 21 Variable explanation for Ergunrsquos equation

                        Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

                        Dp Diameter of the spherical particle mρ Density of the liquid kgm3

                        9

                        CHAPTER 2 FRAME OF REFERENCE

                        Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

                        22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

                        Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

                        Table 22 Outputs of a confusion matrix

                        PredictionPositive Negative

                        Act

                        ual Positive True Positive (TP) False Positive (FP)

                        Negative False Negative (FN) True Negative (TN)

                        The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

                        ACC =sumn

                        i=1 jin

                        where ji =

                        1 if yi = yi

                        0 if yi 6= yi

                        (25)

                        by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

                        10

                        22 PREDICTIVE ANALYTICS

                        In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

                        221 Classification Error Metrics

                        Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

                        Area Under Curve (AUC)

                        AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

                        sensitivity = TP

                        TP + FN(26)

                        specificity = TN

                        TN + FP(27)

                        The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

                        F1 Score

                        The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

                        precision = TP

                        TP + FP(28)

                        recall = TP

                        TP + FN(29)

                        F1 = 2times precisiontimes recallprecision+ recall

                        (210)

                        11

                        CHAPTER 2 FRAME OF REFERENCE

                        Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

                        Logarithmic Loss (Log Loss)

                        For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

                        LogLoss = minusMsum

                        c=1yoclog(poc) (211)

                        222 Regression Error Metrics

                        Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

                        Mean Absolute Error (MAE)

                        Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

                        MAE = 1n

                        nsumi=1|yi minus yi| (212)

                        Mean Squared Error (MSE)

                        The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

                        12

                        22 PREDICTIVE ANALYTICS

                        MSE = 1n

                        nsumi=1

                        (yi minus yi)2 (213)

                        Root Mean Squared Error (RMSE)

                        RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

                        RMSE =

                        radicradicradicradic 1n

                        nsumi=1

                        (yi minus yi)2 (214)

                        The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

                        partRMSE

                        partyi= 1radic

                        MSE

                        partMSE

                        partyi(215)

                        Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

                        Mean Square Percentage Error (MSPE)

                        The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

                        MSPE = 100n

                        nsumi=1

                        (yi minus yi

                        yi

                        )2(216)

                        Mean Absolute Percentage Error (MAPE)

                        The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

                        MAPE = 100n

                        nsumi=1

                        ∣∣∣∣yi minus yi

                        yi

                        ∣∣∣∣ (217)

                        13

                        CHAPTER 2 FRAME OF REFERENCE

                        Coefficient of Determination r2

                        To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

                        r2 =

                        sumni=1((yi minus yi)(yi minus yi))2radicsumn

                        i=1(yi minus yi)2sumni=1(yi minus yi)2

                        2

                        (218)

                        r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

                        Adjusted r2

                        Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

                        r2adj = 1minus

                        [(1minusr2)(nminus1)

                        nminuskminus1

                        ](219)

                        Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

                        223 Stochastic Time Series Models

                        Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

                        Autoregressive Moving Average (ARMA)

                        The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

                        14

                        23 NEURAL NETWORKS

                        value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

                        Autoregressive Integrated Moving Average (ARIMA)

                        ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

                        A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

                        23 Neural Networks

                        231 Overview

                        NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

                        15

                        CHAPTER 2 FRAME OF REFERENCE

                        properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

                        232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

                        output =

                        0 if w middot x+ b le 01 if w middot x+ b gt 0

                        (220)

                        In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

                        233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

                        Sigmoid Function

                        The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

                        f(z) = σ(z) = 11 + eminusz

                        (221)

                        for

                        z =sum

                        j

                        wj middot xj + b (222)

                        16

                        23 NEURAL NETWORKS

                        Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

                        Rectified Function

                        The rectifier activation function is defined as the positive part of its argument [34]

                        f(x) = x+ = max(0 x) (223)

                        for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

                        Swish Function

                        Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

                        f(x) = x middot sigmoid(βx) (224)

                        where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

                        234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

                        Shallow Neural Networks (SNN)

                        SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

                        17

                        CHAPTER 2 FRAME OF REFERENCE

                        tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

                        Deep Neural Networks (DNN)

                        DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

                        f(x) = f (1) + f (2) + + f (n) (225)

                        where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

                        Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

                        Recurring Neural Networks(RNN)

                        Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

                        x1 =[0 0 1 1 0 0 0

                        ]x2 =

                        [0 0 0 1 1 0 0

                        ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

                        18

                        23 NEURAL NETWORKS

                        weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

                        Long Short Term Memory (LSTM) Networks

                        In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

                        it = σ(ωi

                        [htminus1 xt

                        ]+ bi)

                        ot = σ(ωo

                        [htminus1 xt

                        ]+ bo)

                        ft = σ(ωf

                        [htminus1 xt

                        ]+ bf )

                        (226)

                        The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

                        Gated Recurrent Units (GRU)

                        GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

                        19

                        CHAPTER 2 FRAME OF REFERENCE

                        Convolutional Neural Networks (CNN)

                        The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                        The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                        Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                        Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                        20

                        23 NEURAL NETWORKS

                        Figure 25 A max pooling layer with pool size 2 pooling an input

                        The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                        Figure 26 A flattening layer flattening the feature map

                        21

                        Chapter 3

                        Experimental Development

                        This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                        31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                        Figure 31 A complete test cycle

                        23

                        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                        During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                        Figure 32 A test cycle with the backflush stop cut from the data

                        The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                        24

                        31 DATA GATHERING AND PROCESSING

                        Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                        Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                        Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                        As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                        25

                        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                        the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                        Table 31 Amount of data available after preprocessing

                        Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                        Total 3195 1012 2903

                        When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                        32 Model Generation

                        In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                        Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                        The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                        26

                        32 MODEL GENERATION

                        variables The encoding can be done for both integers and tags such as123

                        rarr1 0 0

                        0 1 00 0 1

                        or

                        redbluegreen

                        rarr1 0 0

                        0 1 00 0 1

                        so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                        The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                        xi minusmin(x)max(x)minusmin(x) (31)

                        Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                        321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                        X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                        ](32)

                        X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                        ](33)

                        27

                        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                        When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                        bull Samples - The amount of data points

                        bull Time steps - The points of observation of the samples

                        bull Features - The observed variables per time step

                        The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                        Figure 35 An overview of the LSTM network architecture

                        The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                        322 Regression Processing with the CNN Model

                        As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                        28

                        32 MODEL GENERATION

                        observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                        The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                        Figure 36 An overview of the CNN architecture

                        Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                        323 Label Classification

                        With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                        For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                        29

                        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                        20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                        For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                        33 Model evaluation

                        During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                        For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                        For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                        30

                        34 HARDWARE SPECIFICATIONS

                        Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                        34 Hardware Specifications

                        The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                        Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                        31

                        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                        The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                        The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                        32

                        Chapter 4

                        Results

                        This chapter presents the results for all the models presented in the previous chapter

                        41 LSTM Performance

                        Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                        Figure 41 MAE and MSE loss for the LSTM

                        33

                        CHAPTER 4 RESULTS

                        Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                        Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                        Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                        34

                        41 LSTM PERFORMANCE

                        Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                        Table 41 Evaluation metrics for the LSTM during regression analysis

                        Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                        Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                        Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                        35

                        CHAPTER 4 RESULTS

                        Table 42 Evaluation metrics for the LSTM during classification analysis

                        of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                        Table 43 LSTM confusion matrix

                        PredictionLabel 1 Label 2

                        Act

                        ual Label 1 109 1

                        Label 2 3 669

                        42 CNN Performance

                        Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                        Figure 47 MAE and MSE loss for the CNN

                        36

                        42 CNN PERFORMANCE

                        Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                        Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                        Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                        37

                        CHAPTER 4 RESULTS

                        Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                        Table 44 Evaluation metrics for the CNN during regression analysis

                        Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                        Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                        Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                        38

                        42 CNN PERFORMANCE

                        Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                        Table 45 Evaluation metrics for the CNN during classification analysis

                        Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                        Table 46 CNN confusion matrix for data from the MAE regression network

                        PredictionLabel 1 Label 2

                        Act

                        ual Label 1 82 29

                        Label 2 38 631

                        Table 47 CNN confusion matrix for data from the MSE regression network

                        PredictionLabel 1 Label 2

                        Act

                        ual Label 1 69 41

                        Label 2 11 659

                        39

                        Chapter 5

                        Discussion amp Conclusion

                        This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                        51 The LSTM Network

                        511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                        Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                        The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                        41

                        CHAPTER 5 DISCUSSION amp CONCLUSION

                        while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                        512 Classification Analysis

                        As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                        The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                        52 The CNN

                        521 Regression Analysis

                        The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                        Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                        42

                        52 THE CNN

                        is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                        Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                        522 Classification Analysis

                        Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                        Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                        However the CNN is still doing a good job at predicting future clogging even

                        43

                        CHAPTER 5 DISCUSSION amp CONCLUSION

                        up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                        53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                        54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                        As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                        44

                        Chapter 6

                        Future Work

                        In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                        For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                        On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                        Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                        45

                        Bibliography

                        [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                        [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                        [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                        [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                        [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                        [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                        [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                        [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                        [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                        [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                        47

                        BIBLIOGRAPHY

                        [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                        [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                        [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                        [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                        [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                        [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                        [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                        [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                        [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                        [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                        [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                        48

                        BIBLIOGRAPHY

                        [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                        [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                        [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                        [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                        [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                        [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                        [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                        [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                        [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                        [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                        [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                        [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                        49

                        BIBLIOGRAPHY

                        models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                        [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                        [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                        [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                        [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                        [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                        [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                        [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                        [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                        [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                        [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                        50

                        BIBLIOGRAPHY

                        [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                        [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                        [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                        [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                        [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                        [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                        [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                        51

                        TRITA TRITA-ITM-EX 2019606

                        wwwkthse

                        • Introduction
                          • Background
                          • Problem Description
                          • Purpose Definitions amp Research Questions
                          • Scope and Delimitations
                          • Method Description
                            • Frame of Reference
                              • Filtration amp Clogging Indicators
                                • Basket Filter
                                • Self-Cleaning Basket Filters
                                • Manometer
                                • The Clogging Phenomena
                                • Physics-based Modelling
                                  • Predictive Analytics
                                    • Classification Error Metrics
                                    • Regression Error Metrics
                                    • Stochastic Time Series Models
                                      • Neural Networks
                                        • Overview
                                        • The Perceptron
                                        • Activation functions
                                        • Neural Network Architectures
                                            • Experimental Development
                                              • Data Gathering and Processing
                                              • Model Generation
                                                • Regression Processing with the LSTM Model
                                                • Regression Processing with the CNN Model
                                                • Label Classification
                                                  • Model evaluation
                                                  • Hardware Specifications
                                                    • Results
                                                      • LSTM Performance
                                                      • CNN Performance
                                                        • Discussion amp Conclusion
                                                          • The LSTM Network
                                                            • Regression Analysis
                                                            • Classification Analysis
                                                              • The CNN
                                                                • Regression Analysis
                                                                • Classification Analysis
                                                                  • Comparison Between Both Networks
                                                                  • Conclusion
                                                                    • Future Work
                                                                    • Bibliography

                          21 FILTRATION amp CLOGGING INDICATORS

                          Figure 22 An overview of a basket filter with self-cleaning2

                          The backwashing flow can either be controlled by a pump or be completely depen-dent on the existing overpressure in the pressure vessel which in turn depends onhow clogged the filter is For that latter case backwashing of the filter may only bedone when there is enough particles in the water so that the filter begins to clog

                          213 Manometer

                          Briefly mentioned in 211 the manometer is an analogue display pressure gaugethat shows the differential pressure over the filter The displayed value is the differ-ence of the pressure obtained by the transducers before and after the filter Eachfilter comes with an individually set threshold pset

                          When the measured differential pressure is greater than pset the filter has to becleaned For a regular basket filter the operator or the service engineer has to keepan eye on the manometer during operation However for a self-cleaning basket filterthe pressure transducers are connected to an electric control system that switchesthe backwash on and off

                          2Source httpwwwdirectindustrycom

                          7

                          CHAPTER 2 FRAME OF REFERENCE

                          214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

                          1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

                          2 a decrease in Q as a result of an increase in ∆p

                          These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

                          1 steady state ∆p and Qrarr Nolittle clogging

                          2 linear increase in ∆p and steady Qrarr Moderate clogging

                          3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

                          With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

                          Figure 23 Visualization of the clogging states3

                          3Source Eker et al [6]

                          8

                          21 FILTRATION amp CLOGGING INDICATORS

                          215 Physics-based Modelling

                          The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

                          Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

                          QL = KA

                          microL∆p (21)

                          rewritten as

                          ∆p = microL

                          KAQL (22)

                          A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

                          ∆p = kVsmicro

                          Φ2D2p

                          (1minus ε)2L

                          ε3(23)

                          Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

                          ∆p = 150Vsmicro(1minus ε)2L

                          D2pε

                          3 + 175(1minus ε)ρV 2s L

                          ε3Dp(24)

                          where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

                          Table 21 Variable explanation for Ergunrsquos equation

                          Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

                          Dp Diameter of the spherical particle mρ Density of the liquid kgm3

                          9

                          CHAPTER 2 FRAME OF REFERENCE

                          Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

                          22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

                          Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

                          Table 22 Outputs of a confusion matrix

                          PredictionPositive Negative

                          Act

                          ual Positive True Positive (TP) False Positive (FP)

                          Negative False Negative (FN) True Negative (TN)

                          The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

                          ACC =sumn

                          i=1 jin

                          where ji =

                          1 if yi = yi

                          0 if yi 6= yi

                          (25)

                          by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

                          10

                          22 PREDICTIVE ANALYTICS

                          In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

                          221 Classification Error Metrics

                          Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

                          Area Under Curve (AUC)

                          AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

                          sensitivity = TP

                          TP + FN(26)

                          specificity = TN

                          TN + FP(27)

                          The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

                          F1 Score

                          The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

                          precision = TP

                          TP + FP(28)

                          recall = TP

                          TP + FN(29)

                          F1 = 2times precisiontimes recallprecision+ recall

                          (210)

                          11

                          CHAPTER 2 FRAME OF REFERENCE

                          Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

                          Logarithmic Loss (Log Loss)

                          For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

                          LogLoss = minusMsum

                          c=1yoclog(poc) (211)

                          222 Regression Error Metrics

                          Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

                          Mean Absolute Error (MAE)

                          Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

                          MAE = 1n

                          nsumi=1|yi minus yi| (212)

                          Mean Squared Error (MSE)

                          The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

                          12

                          22 PREDICTIVE ANALYTICS

                          MSE = 1n

                          nsumi=1

                          (yi minus yi)2 (213)

                          Root Mean Squared Error (RMSE)

                          RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

                          RMSE =

                          radicradicradicradic 1n

                          nsumi=1

                          (yi minus yi)2 (214)

                          The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

                          partRMSE

                          partyi= 1radic

                          MSE

                          partMSE

                          partyi(215)

                          Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

                          Mean Square Percentage Error (MSPE)

                          The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

                          MSPE = 100n

                          nsumi=1

                          (yi minus yi

                          yi

                          )2(216)

                          Mean Absolute Percentage Error (MAPE)

                          The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

                          MAPE = 100n

                          nsumi=1

                          ∣∣∣∣yi minus yi

                          yi

                          ∣∣∣∣ (217)

                          13

                          CHAPTER 2 FRAME OF REFERENCE

                          Coefficient of Determination r2

                          To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

                          r2 =

                          sumni=1((yi minus yi)(yi minus yi))2radicsumn

                          i=1(yi minus yi)2sumni=1(yi minus yi)2

                          2

                          (218)

                          r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

                          Adjusted r2

                          Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

                          r2adj = 1minus

                          [(1minusr2)(nminus1)

                          nminuskminus1

                          ](219)

                          Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

                          223 Stochastic Time Series Models

                          Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

                          Autoregressive Moving Average (ARMA)

                          The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

                          14

                          23 NEURAL NETWORKS

                          value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

                          Autoregressive Integrated Moving Average (ARIMA)

                          ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

                          A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

                          23 Neural Networks

                          231 Overview

                          NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

                          15

                          CHAPTER 2 FRAME OF REFERENCE

                          properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

                          232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

                          output =

                          0 if w middot x+ b le 01 if w middot x+ b gt 0

                          (220)

                          In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

                          233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

                          Sigmoid Function

                          The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

                          f(z) = σ(z) = 11 + eminusz

                          (221)

                          for

                          z =sum

                          j

                          wj middot xj + b (222)

                          16

                          23 NEURAL NETWORKS

                          Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

                          Rectified Function

                          The rectifier activation function is defined as the positive part of its argument [34]

                          f(x) = x+ = max(0 x) (223)

                          for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

                          Swish Function

                          Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

                          f(x) = x middot sigmoid(βx) (224)

                          where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

                          234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

                          Shallow Neural Networks (SNN)

                          SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

                          17

                          CHAPTER 2 FRAME OF REFERENCE

                          tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

                          Deep Neural Networks (DNN)

                          DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

                          f(x) = f (1) + f (2) + + f (n) (225)

                          where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

                          Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

                          Recurring Neural Networks(RNN)

                          Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

                          x1 =[0 0 1 1 0 0 0

                          ]x2 =

                          [0 0 0 1 1 0 0

                          ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

                          18

                          23 NEURAL NETWORKS

                          weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

                          Long Short Term Memory (LSTM) Networks

                          In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

                          it = σ(ωi

                          [htminus1 xt

                          ]+ bi)

                          ot = σ(ωo

                          [htminus1 xt

                          ]+ bo)

                          ft = σ(ωf

                          [htminus1 xt

                          ]+ bf )

                          (226)

                          The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

                          Gated Recurrent Units (GRU)

                          GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

                          19

                          CHAPTER 2 FRAME OF REFERENCE

                          Convolutional Neural Networks (CNN)

                          The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                          The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                          Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                          Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                          20

                          23 NEURAL NETWORKS

                          Figure 25 A max pooling layer with pool size 2 pooling an input

                          The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                          Figure 26 A flattening layer flattening the feature map

                          21

                          Chapter 3

                          Experimental Development

                          This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                          31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                          Figure 31 A complete test cycle

                          23

                          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                          During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                          Figure 32 A test cycle with the backflush stop cut from the data

                          The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                          24

                          31 DATA GATHERING AND PROCESSING

                          Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                          Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                          Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                          As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                          25

                          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                          the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                          Table 31 Amount of data available after preprocessing

                          Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                          Total 3195 1012 2903

                          When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                          32 Model Generation

                          In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                          Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                          The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                          26

                          32 MODEL GENERATION

                          variables The encoding can be done for both integers and tags such as123

                          rarr1 0 0

                          0 1 00 0 1

                          or

                          redbluegreen

                          rarr1 0 0

                          0 1 00 0 1

                          so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                          The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                          xi minusmin(x)max(x)minusmin(x) (31)

                          Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                          321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                          X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                          ](32)

                          X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                          ](33)

                          27

                          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                          When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                          bull Samples - The amount of data points

                          bull Time steps - The points of observation of the samples

                          bull Features - The observed variables per time step

                          The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                          Figure 35 An overview of the LSTM network architecture

                          The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                          322 Regression Processing with the CNN Model

                          As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                          28

                          32 MODEL GENERATION

                          observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                          The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                          Figure 36 An overview of the CNN architecture

                          Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                          323 Label Classification

                          With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                          For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                          29

                          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                          20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                          For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                          33 Model evaluation

                          During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                          For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                          For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                          30

                          34 HARDWARE SPECIFICATIONS

                          Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                          34 Hardware Specifications

                          The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                          Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                          31

                          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                          The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                          The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                          32

                          Chapter 4

                          Results

                          This chapter presents the results for all the models presented in the previous chapter

                          41 LSTM Performance

                          Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                          Figure 41 MAE and MSE loss for the LSTM

                          33

                          CHAPTER 4 RESULTS

                          Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                          Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                          Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                          34

                          41 LSTM PERFORMANCE

                          Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                          Table 41 Evaluation metrics for the LSTM during regression analysis

                          Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                          Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                          Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                          35

                          CHAPTER 4 RESULTS

                          Table 42 Evaluation metrics for the LSTM during classification analysis

                          of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                          Table 43 LSTM confusion matrix

                          PredictionLabel 1 Label 2

                          Act

                          ual Label 1 109 1

                          Label 2 3 669

                          42 CNN Performance

                          Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                          Figure 47 MAE and MSE loss for the CNN

                          36

                          42 CNN PERFORMANCE

                          Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                          Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                          Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                          37

                          CHAPTER 4 RESULTS

                          Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                          Table 44 Evaluation metrics for the CNN during regression analysis

                          Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                          Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                          Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                          38

                          42 CNN PERFORMANCE

                          Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                          Table 45 Evaluation metrics for the CNN during classification analysis

                          Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                          Table 46 CNN confusion matrix for data from the MAE regression network

                          PredictionLabel 1 Label 2

                          Act

                          ual Label 1 82 29

                          Label 2 38 631

                          Table 47 CNN confusion matrix for data from the MSE regression network

                          PredictionLabel 1 Label 2

                          Act

                          ual Label 1 69 41

                          Label 2 11 659

                          39

                          Chapter 5

                          Discussion amp Conclusion

                          This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                          51 The LSTM Network

                          511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                          Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                          The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                          41

                          CHAPTER 5 DISCUSSION amp CONCLUSION

                          while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                          512 Classification Analysis

                          As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                          The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                          52 The CNN

                          521 Regression Analysis

                          The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                          Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                          42

                          52 THE CNN

                          is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                          Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                          522 Classification Analysis

                          Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                          Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                          However the CNN is still doing a good job at predicting future clogging even

                          43

                          CHAPTER 5 DISCUSSION amp CONCLUSION

                          up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                          53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                          54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                          As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                          44

                          Chapter 6

                          Future Work

                          In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                          For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                          On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                          Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                          45

                          Bibliography

                          [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                          [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                          [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                          [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                          [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                          [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                          [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                          [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                          [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                          [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                          47

                          BIBLIOGRAPHY

                          [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                          [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                          [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                          [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                          [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                          [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                          [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                          [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                          [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                          [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                          [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                          48

                          BIBLIOGRAPHY

                          [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                          [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                          [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                          [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                          [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                          [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                          [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                          [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                          [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                          [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                          [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                          [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                          49

                          BIBLIOGRAPHY

                          models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                          [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                          [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                          [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                          [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                          [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                          [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                          [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                          [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                          [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                          [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                          50

                          BIBLIOGRAPHY

                          [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                          [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                          [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                          [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                          [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                          [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                          [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                          51

                          TRITA TRITA-ITM-EX 2019606

                          wwwkthse

                          • Introduction
                            • Background
                            • Problem Description
                            • Purpose Definitions amp Research Questions
                            • Scope and Delimitations
                            • Method Description
                              • Frame of Reference
                                • Filtration amp Clogging Indicators
                                  • Basket Filter
                                  • Self-Cleaning Basket Filters
                                  • Manometer
                                  • The Clogging Phenomena
                                  • Physics-based Modelling
                                    • Predictive Analytics
                                      • Classification Error Metrics
                                      • Regression Error Metrics
                                      • Stochastic Time Series Models
                                        • Neural Networks
                                          • Overview
                                          • The Perceptron
                                          • Activation functions
                                          • Neural Network Architectures
                                              • Experimental Development
                                                • Data Gathering and Processing
                                                • Model Generation
                                                  • Regression Processing with the LSTM Model
                                                  • Regression Processing with the CNN Model
                                                  • Label Classification
                                                    • Model evaluation
                                                    • Hardware Specifications
                                                      • Results
                                                        • LSTM Performance
                                                        • CNN Performance
                                                          • Discussion amp Conclusion
                                                            • The LSTM Network
                                                              • Regression Analysis
                                                              • Classification Analysis
                                                                • The CNN
                                                                  • Regression Analysis
                                                                  • Classification Analysis
                                                                    • Comparison Between Both Networks
                                                                    • Conclusion
                                                                      • Future Work
                                                                      • Bibliography

                            CHAPTER 2 FRAME OF REFERENCE

                            214 The Clogging PhenomenaTo predict the clogging phenomena some indicators of clogging have to be identifiedIndicators of clogging in fuel filters have been investigated and discussed in a seriesof papers by Eker et al [4ndash6] A fuel filter shares a lot of similarities with a basketfilter in the sense that they both remove particles in the supplied liquid in order toget a filtrate Two indicators were especially taken into consideration namely thedifferential pressure over the filter ∆p and the flow rate after the filter Q Theresults from the papers show that clogging of a filter occurs due to the following

                            1 a steady increase in ∆p over time due to an increase over time in incomingpressure pin

                            2 a decrease in Q as a result of an increase in ∆p

                            These conclusions suggest that a modelling approach to identify clogging is possibleBy observing the two variables from the start of a pumping process the followingclogging states can be identified

                            1 steady state ∆p and Qrarr Nolittle clogging

                            2 linear increase in ∆p and steady Qrarr Moderate clogging

                            3 exponential increase in ∆p and drastic decrease in Qrarr Fully clogged

                            With the established logic of classification in place each individual pumping se-quence can be classified to begin generating a dataset containing the necessaryinformation

                            Figure 23 Visualization of the clogging states3

                            3Source Eker et al [6]

                            8

                            21 FILTRATION amp CLOGGING INDICATORS

                            215 Physics-based Modelling

                            The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

                            Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

                            QL = KA

                            microL∆p (21)

                            rewritten as

                            ∆p = microL

                            KAQL (22)

                            A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

                            ∆p = kVsmicro

                            Φ2D2p

                            (1minus ε)2L

                            ε3(23)

                            Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

                            ∆p = 150Vsmicro(1minus ε)2L

                            D2pε

                            3 + 175(1minus ε)ρV 2s L

                            ε3Dp(24)

                            where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

                            Table 21 Variable explanation for Ergunrsquos equation

                            Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

                            Dp Diameter of the spherical particle mρ Density of the liquid kgm3

                            9

                            CHAPTER 2 FRAME OF REFERENCE

                            Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

                            22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

                            Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

                            Table 22 Outputs of a confusion matrix

                            PredictionPositive Negative

                            Act

                            ual Positive True Positive (TP) False Positive (FP)

                            Negative False Negative (FN) True Negative (TN)

                            The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

                            ACC =sumn

                            i=1 jin

                            where ji =

                            1 if yi = yi

                            0 if yi 6= yi

                            (25)

                            by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

                            10

                            22 PREDICTIVE ANALYTICS

                            In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

                            221 Classification Error Metrics

                            Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

                            Area Under Curve (AUC)

                            AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

                            sensitivity = TP

                            TP + FN(26)

                            specificity = TN

                            TN + FP(27)

                            The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

                            F1 Score

                            The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

                            precision = TP

                            TP + FP(28)

                            recall = TP

                            TP + FN(29)

                            F1 = 2times precisiontimes recallprecision+ recall

                            (210)

                            11

                            CHAPTER 2 FRAME OF REFERENCE

                            Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

                            Logarithmic Loss (Log Loss)

                            For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

                            LogLoss = minusMsum

                            c=1yoclog(poc) (211)

                            222 Regression Error Metrics

                            Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

                            Mean Absolute Error (MAE)

                            Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

                            MAE = 1n

                            nsumi=1|yi minus yi| (212)

                            Mean Squared Error (MSE)

                            The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

                            12

                            22 PREDICTIVE ANALYTICS

                            MSE = 1n

                            nsumi=1

                            (yi minus yi)2 (213)

                            Root Mean Squared Error (RMSE)

                            RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

                            RMSE =

                            radicradicradicradic 1n

                            nsumi=1

                            (yi minus yi)2 (214)

                            The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

                            partRMSE

                            partyi= 1radic

                            MSE

                            partMSE

                            partyi(215)

                            Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

                            Mean Square Percentage Error (MSPE)

                            The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

                            MSPE = 100n

                            nsumi=1

                            (yi minus yi

                            yi

                            )2(216)

                            Mean Absolute Percentage Error (MAPE)

                            The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

                            MAPE = 100n

                            nsumi=1

                            ∣∣∣∣yi minus yi

                            yi

                            ∣∣∣∣ (217)

                            13

                            CHAPTER 2 FRAME OF REFERENCE

                            Coefficient of Determination r2

                            To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

                            r2 =

                            sumni=1((yi minus yi)(yi minus yi))2radicsumn

                            i=1(yi minus yi)2sumni=1(yi minus yi)2

                            2

                            (218)

                            r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

                            Adjusted r2

                            Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

                            r2adj = 1minus

                            [(1minusr2)(nminus1)

                            nminuskminus1

                            ](219)

                            Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

                            223 Stochastic Time Series Models

                            Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

                            Autoregressive Moving Average (ARMA)

                            The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

                            14

                            23 NEURAL NETWORKS

                            value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

                            Autoregressive Integrated Moving Average (ARIMA)

                            ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

                            A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

                            23 Neural Networks

                            231 Overview

                            NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

                            15

                            CHAPTER 2 FRAME OF REFERENCE

                            properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

                            232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

                            output =

                            0 if w middot x+ b le 01 if w middot x+ b gt 0

                            (220)

                            In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

                            233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

                            Sigmoid Function

                            The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

                            f(z) = σ(z) = 11 + eminusz

                            (221)

                            for

                            z =sum

                            j

                            wj middot xj + b (222)

                            16

                            23 NEURAL NETWORKS

                            Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

                            Rectified Function

                            The rectifier activation function is defined as the positive part of its argument [34]

                            f(x) = x+ = max(0 x) (223)

                            for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

                            Swish Function

                            Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

                            f(x) = x middot sigmoid(βx) (224)

                            where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

                            234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

                            Shallow Neural Networks (SNN)

                            SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

                            17

                            CHAPTER 2 FRAME OF REFERENCE

                            tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

                            Deep Neural Networks (DNN)

                            DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

                            f(x) = f (1) + f (2) + + f (n) (225)

                            where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

                            Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

                            Recurring Neural Networks(RNN)

                            Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

                            x1 =[0 0 1 1 0 0 0

                            ]x2 =

                            [0 0 0 1 1 0 0

                            ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

                            18

                            23 NEURAL NETWORKS

                            weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

                            Long Short Term Memory (LSTM) Networks

                            In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

                            it = σ(ωi

                            [htminus1 xt

                            ]+ bi)

                            ot = σ(ωo

                            [htminus1 xt

                            ]+ bo)

                            ft = σ(ωf

                            [htminus1 xt

                            ]+ bf )

                            (226)

                            The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

                            Gated Recurrent Units (GRU)

                            GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

                            19

                            CHAPTER 2 FRAME OF REFERENCE

                            Convolutional Neural Networks (CNN)

                            The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                            The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                            Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                            Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                            20

                            23 NEURAL NETWORKS

                            Figure 25 A max pooling layer with pool size 2 pooling an input

                            The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                            Figure 26 A flattening layer flattening the feature map

                            21

                            Chapter 3

                            Experimental Development

                            This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                            31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                            Figure 31 A complete test cycle

                            23

                            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                            During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                            Figure 32 A test cycle with the backflush stop cut from the data

                            The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                            24

                            31 DATA GATHERING AND PROCESSING

                            Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                            Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                            Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                            As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                            25

                            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                            the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                            Table 31 Amount of data available after preprocessing

                            Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                            Total 3195 1012 2903

                            When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                            32 Model Generation

                            In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                            Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                            The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                            26

                            32 MODEL GENERATION

                            variables The encoding can be done for both integers and tags such as123

                            rarr1 0 0

                            0 1 00 0 1

                            or

                            redbluegreen

                            rarr1 0 0

                            0 1 00 0 1

                            so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                            The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                            xi minusmin(x)max(x)minusmin(x) (31)

                            Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                            321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                            X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                            ](32)

                            X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                            ](33)

                            27

                            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                            When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                            bull Samples - The amount of data points

                            bull Time steps - The points of observation of the samples

                            bull Features - The observed variables per time step

                            The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                            Figure 35 An overview of the LSTM network architecture

                            The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                            322 Regression Processing with the CNN Model

                            As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                            28

                            32 MODEL GENERATION

                            observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                            The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                            Figure 36 An overview of the CNN architecture

                            Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                            323 Label Classification

                            With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                            For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                            29

                            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                            20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                            For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                            33 Model evaluation

                            During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                            For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                            For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                            30

                            34 HARDWARE SPECIFICATIONS

                            Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                            34 Hardware Specifications

                            The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                            Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                            31

                            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                            The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                            The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                            32

                            Chapter 4

                            Results

                            This chapter presents the results for all the models presented in the previous chapter

                            41 LSTM Performance

                            Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                            Figure 41 MAE and MSE loss for the LSTM

                            33

                            CHAPTER 4 RESULTS

                            Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                            Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                            Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                            34

                            41 LSTM PERFORMANCE

                            Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                            Table 41 Evaluation metrics for the LSTM during regression analysis

                            Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                            Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                            Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                            35

                            CHAPTER 4 RESULTS

                            Table 42 Evaluation metrics for the LSTM during classification analysis

                            of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                            Table 43 LSTM confusion matrix

                            PredictionLabel 1 Label 2

                            Act

                            ual Label 1 109 1

                            Label 2 3 669

                            42 CNN Performance

                            Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                            Figure 47 MAE and MSE loss for the CNN

                            36

                            42 CNN PERFORMANCE

                            Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                            Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                            Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                            37

                            CHAPTER 4 RESULTS

                            Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                            Table 44 Evaluation metrics for the CNN during regression analysis

                            Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                            Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                            Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                            38

                            42 CNN PERFORMANCE

                            Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                            Table 45 Evaluation metrics for the CNN during classification analysis

                            Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                            Table 46 CNN confusion matrix for data from the MAE regression network

                            PredictionLabel 1 Label 2

                            Act

                            ual Label 1 82 29

                            Label 2 38 631

                            Table 47 CNN confusion matrix for data from the MSE regression network

                            PredictionLabel 1 Label 2

                            Act

                            ual Label 1 69 41

                            Label 2 11 659

                            39

                            Chapter 5

                            Discussion amp Conclusion

                            This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                            51 The LSTM Network

                            511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                            Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                            The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                            41

                            CHAPTER 5 DISCUSSION amp CONCLUSION

                            while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                            512 Classification Analysis

                            As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                            The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                            52 The CNN

                            521 Regression Analysis

                            The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                            Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                            42

                            52 THE CNN

                            is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                            Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                            522 Classification Analysis

                            Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                            Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                            However the CNN is still doing a good job at predicting future clogging even

                            43

                            CHAPTER 5 DISCUSSION amp CONCLUSION

                            up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                            53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                            54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                            As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                            44

                            Chapter 6

                            Future Work

                            In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                            For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                            On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                            Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                            45

                            Bibliography

                            [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                            [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                            [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                            [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                            [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                            [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                            [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                            [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                            [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                            [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                            47

                            BIBLIOGRAPHY

                            [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                            [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                            [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                            [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                            [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                            [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                            [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                            [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                            [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                            [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                            [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                            48

                            BIBLIOGRAPHY

                            [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                            [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                            [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                            [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                            [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                            [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                            [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                            [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                            [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                            [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                            [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                            [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                            49

                            BIBLIOGRAPHY

                            models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                            [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                            [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                            [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                            [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                            [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                            [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                            [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                            [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                            [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                            [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                            50

                            BIBLIOGRAPHY

                            [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                            [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                            [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                            [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                            [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                            [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                            [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                            51

                            TRITA TRITA-ITM-EX 2019606

                            wwwkthse

                            • Introduction
                              • Background
                              • Problem Description
                              • Purpose Definitions amp Research Questions
                              • Scope and Delimitations
                              • Method Description
                                • Frame of Reference
                                  • Filtration amp Clogging Indicators
                                    • Basket Filter
                                    • Self-Cleaning Basket Filters
                                    • Manometer
                                    • The Clogging Phenomena
                                    • Physics-based Modelling
                                      • Predictive Analytics
                                        • Classification Error Metrics
                                        • Regression Error Metrics
                                        • Stochastic Time Series Models
                                          • Neural Networks
                                            • Overview
                                            • The Perceptron
                                            • Activation functions
                                            • Neural Network Architectures
                                                • Experimental Development
                                                  • Data Gathering and Processing
                                                  • Model Generation
                                                    • Regression Processing with the LSTM Model
                                                    • Regression Processing with the CNN Model
                                                    • Label Classification
                                                      • Model evaluation
                                                      • Hardware Specifications
                                                        • Results
                                                          • LSTM Performance
                                                          • CNN Performance
                                                            • Discussion amp Conclusion
                                                              • The LSTM Network
                                                                • Regression Analysis
                                                                • Classification Analysis
                                                                  • The CNN
                                                                    • Regression Analysis
                                                                    • Classification Analysis
                                                                      • Comparison Between Both Networks
                                                                      • Conclusion
                                                                        • Future Work
                                                                        • Bibliography

                              21 FILTRATION amp CLOGGING INDICATORS

                              215 Physics-based Modelling

                              The pressure drop over the filter has been identified as a key indicator of cloggingTo better understand what effect certain parameters have on the filter a model hasto be created Roussel et al [7] identify the filter clogging as a probability of thepresence of particles Furthermore they identify the clogging process as a functionof a set of variables the ratio of particle to mesh hole size the solid fraction andthe number of grains arriving at each mesh hole during one test

                              Filtration of liquids and filter clogging have been tested for various types of flowsLaminar flow through a permeable medium has been investigated by Wakeman [8]and it can be described by Darcyrsquos equation [9] as

                              QL = KA

                              microL∆p (21)

                              rewritten as

                              ∆p = microL

                              KAQL (22)

                              A more recent and commonly used equation for absolute permeability is the Kozeny-Carman equation The equation was derived by Kozeny and Carman [10] and reads

                              ∆p = kVsmicro

                              Φ2D2p

                              (1minus ε)2L

                              ε3(23)

                              Equation 23 is flawed in the sense that it does not take into account the inertialeffect in the flow This is considered by the later Ergun equation [11]

                              ∆p = 150Vsmicro(1minus ε)2L

                              D2pε

                              3 + 175(1minus ε)ρV 2s L

                              ε3Dp(24)

                              where the first term in Equation 24 represents the viscous effects and the secondterm represents the inertial effect An explanation for the variables can be found inTable 21

                              Table 21 Variable explanation for Ergunrsquos equation

                              Variable Description Unit∆p Pressure drop PaL Total height of filter cake mVs Superficial (empty-tower) velocity msmicro Viscosity of the fluid kgmsε Porosity of the filter cake m2

                              Dp Diameter of the spherical particle mρ Density of the liquid kgm3

                              9

                              CHAPTER 2 FRAME OF REFERENCE

                              Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

                              22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

                              Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

                              Table 22 Outputs of a confusion matrix

                              PredictionPositive Negative

                              Act

                              ual Positive True Positive (TP) False Positive (FP)

                              Negative False Negative (FN) True Negative (TN)

                              The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

                              ACC =sumn

                              i=1 jin

                              where ji =

                              1 if yi = yi

                              0 if yi 6= yi

                              (25)

                              by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

                              10

                              22 PREDICTIVE ANALYTICS

                              In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

                              221 Classification Error Metrics

                              Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

                              Area Under Curve (AUC)

                              AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

                              sensitivity = TP

                              TP + FN(26)

                              specificity = TN

                              TN + FP(27)

                              The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

                              F1 Score

                              The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

                              precision = TP

                              TP + FP(28)

                              recall = TP

                              TP + FN(29)

                              F1 = 2times precisiontimes recallprecision+ recall

                              (210)

                              11

                              CHAPTER 2 FRAME OF REFERENCE

                              Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

                              Logarithmic Loss (Log Loss)

                              For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

                              LogLoss = minusMsum

                              c=1yoclog(poc) (211)

                              222 Regression Error Metrics

                              Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

                              Mean Absolute Error (MAE)

                              Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

                              MAE = 1n

                              nsumi=1|yi minus yi| (212)

                              Mean Squared Error (MSE)

                              The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

                              12

                              22 PREDICTIVE ANALYTICS

                              MSE = 1n

                              nsumi=1

                              (yi minus yi)2 (213)

                              Root Mean Squared Error (RMSE)

                              RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

                              RMSE =

                              radicradicradicradic 1n

                              nsumi=1

                              (yi minus yi)2 (214)

                              The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

                              partRMSE

                              partyi= 1radic

                              MSE

                              partMSE

                              partyi(215)

                              Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

                              Mean Square Percentage Error (MSPE)

                              The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

                              MSPE = 100n

                              nsumi=1

                              (yi minus yi

                              yi

                              )2(216)

                              Mean Absolute Percentage Error (MAPE)

                              The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

                              MAPE = 100n

                              nsumi=1

                              ∣∣∣∣yi minus yi

                              yi

                              ∣∣∣∣ (217)

                              13

                              CHAPTER 2 FRAME OF REFERENCE

                              Coefficient of Determination r2

                              To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

                              r2 =

                              sumni=1((yi minus yi)(yi minus yi))2radicsumn

                              i=1(yi minus yi)2sumni=1(yi minus yi)2

                              2

                              (218)

                              r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

                              Adjusted r2

                              Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

                              r2adj = 1minus

                              [(1minusr2)(nminus1)

                              nminuskminus1

                              ](219)

                              Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

                              223 Stochastic Time Series Models

                              Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

                              Autoregressive Moving Average (ARMA)

                              The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

                              14

                              23 NEURAL NETWORKS

                              value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

                              Autoregressive Integrated Moving Average (ARIMA)

                              ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

                              A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

                              23 Neural Networks

                              231 Overview

                              NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

                              15

                              CHAPTER 2 FRAME OF REFERENCE

                              properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

                              232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

                              output =

                              0 if w middot x+ b le 01 if w middot x+ b gt 0

                              (220)

                              In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

                              233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

                              Sigmoid Function

                              The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

                              f(z) = σ(z) = 11 + eminusz

                              (221)

                              for

                              z =sum

                              j

                              wj middot xj + b (222)

                              16

                              23 NEURAL NETWORKS

                              Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

                              Rectified Function

                              The rectifier activation function is defined as the positive part of its argument [34]

                              f(x) = x+ = max(0 x) (223)

                              for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

                              Swish Function

                              Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

                              f(x) = x middot sigmoid(βx) (224)

                              where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

                              234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

                              Shallow Neural Networks (SNN)

                              SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

                              17

                              CHAPTER 2 FRAME OF REFERENCE

                              tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

                              Deep Neural Networks (DNN)

                              DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

                              f(x) = f (1) + f (2) + + f (n) (225)

                              where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

                              Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

                              Recurring Neural Networks(RNN)

                              Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

                              x1 =[0 0 1 1 0 0 0

                              ]x2 =

                              [0 0 0 1 1 0 0

                              ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

                              18

                              23 NEURAL NETWORKS

                              weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

                              Long Short Term Memory (LSTM) Networks

                              In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

                              it = σ(ωi

                              [htminus1 xt

                              ]+ bi)

                              ot = σ(ωo

                              [htminus1 xt

                              ]+ bo)

                              ft = σ(ωf

                              [htminus1 xt

                              ]+ bf )

                              (226)

                              The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

                              Gated Recurrent Units (GRU)

                              GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

                              19

                              CHAPTER 2 FRAME OF REFERENCE

                              Convolutional Neural Networks (CNN)

                              The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                              The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                              Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                              Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                              20

                              23 NEURAL NETWORKS

                              Figure 25 A max pooling layer with pool size 2 pooling an input

                              The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                              Figure 26 A flattening layer flattening the feature map

                              21

                              Chapter 3

                              Experimental Development

                              This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                              31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                              Figure 31 A complete test cycle

                              23

                              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                              During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                              Figure 32 A test cycle with the backflush stop cut from the data

                              The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                              24

                              31 DATA GATHERING AND PROCESSING

                              Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                              Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                              Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                              As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                              25

                              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                              the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                              Table 31 Amount of data available after preprocessing

                              Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                              Total 3195 1012 2903

                              When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                              32 Model Generation

                              In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                              Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                              The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                              26

                              32 MODEL GENERATION

                              variables The encoding can be done for both integers and tags such as123

                              rarr1 0 0

                              0 1 00 0 1

                              or

                              redbluegreen

                              rarr1 0 0

                              0 1 00 0 1

                              so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                              The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                              xi minusmin(x)max(x)minusmin(x) (31)

                              Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                              321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                              X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                              ](32)

                              X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                              ](33)

                              27

                              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                              When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                              bull Samples - The amount of data points

                              bull Time steps - The points of observation of the samples

                              bull Features - The observed variables per time step

                              The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                              Figure 35 An overview of the LSTM network architecture

                              The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                              322 Regression Processing with the CNN Model

                              As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                              28

                              32 MODEL GENERATION

                              observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                              The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                              Figure 36 An overview of the CNN architecture

                              Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                              323 Label Classification

                              With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                              For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                              29

                              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                              20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                              For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                              33 Model evaluation

                              During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                              For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                              For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                              30

                              34 HARDWARE SPECIFICATIONS

                              Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                              34 Hardware Specifications

                              The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                              Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                              31

                              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                              The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                              The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                              32

                              Chapter 4

                              Results

                              This chapter presents the results for all the models presented in the previous chapter

                              41 LSTM Performance

                              Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                              Figure 41 MAE and MSE loss for the LSTM

                              33

                              CHAPTER 4 RESULTS

                              Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                              Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                              Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                              34

                              41 LSTM PERFORMANCE

                              Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                              Table 41 Evaluation metrics for the LSTM during regression analysis

                              Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                              Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                              Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                              35

                              CHAPTER 4 RESULTS

                              Table 42 Evaluation metrics for the LSTM during classification analysis

                              of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                              Table 43 LSTM confusion matrix

                              PredictionLabel 1 Label 2

                              Act

                              ual Label 1 109 1

                              Label 2 3 669

                              42 CNN Performance

                              Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                              Figure 47 MAE and MSE loss for the CNN

                              36

                              42 CNN PERFORMANCE

                              Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                              Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                              Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                              37

                              CHAPTER 4 RESULTS

                              Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                              Table 44 Evaluation metrics for the CNN during regression analysis

                              Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                              Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                              Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                              38

                              42 CNN PERFORMANCE

                              Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                              Table 45 Evaluation metrics for the CNN during classification analysis

                              Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                              Table 46 CNN confusion matrix for data from the MAE regression network

                              PredictionLabel 1 Label 2

                              Act

                              ual Label 1 82 29

                              Label 2 38 631

                              Table 47 CNN confusion matrix for data from the MSE regression network

                              PredictionLabel 1 Label 2

                              Act

                              ual Label 1 69 41

                              Label 2 11 659

                              39

                              Chapter 5

                              Discussion amp Conclusion

                              This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                              51 The LSTM Network

                              511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                              Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                              The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                              41

                              CHAPTER 5 DISCUSSION amp CONCLUSION

                              while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                              512 Classification Analysis

                              As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                              The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                              52 The CNN

                              521 Regression Analysis

                              The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                              Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                              42

                              52 THE CNN

                              is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                              Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                              522 Classification Analysis

                              Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                              Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                              However the CNN is still doing a good job at predicting future clogging even

                              43

                              CHAPTER 5 DISCUSSION amp CONCLUSION

                              up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                              53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                              54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                              As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                              44

                              Chapter 6

                              Future Work

                              In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                              For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                              On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                              Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                              45

                              Bibliography

                              [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                              [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                              [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                              [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                              [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                              [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                              [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                              [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                              [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                              [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                              47

                              BIBLIOGRAPHY

                              [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                              [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                              [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                              [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                              [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                              [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                              [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                              [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                              [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                              [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                              [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                              48

                              BIBLIOGRAPHY

                              [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                              [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                              [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                              [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                              [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                              [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                              [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                              [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                              [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                              [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                              [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                              [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                              49

                              BIBLIOGRAPHY

                              models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                              [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                              [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                              [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                              [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                              [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                              [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                              [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                              [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                              [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                              [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                              50

                              BIBLIOGRAPHY

                              [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                              [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                              [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                              [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                              [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                              [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                              [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                              51

                              TRITA TRITA-ITM-EX 2019606

                              wwwkthse

                              • Introduction
                                • Background
                                • Problem Description
                                • Purpose Definitions amp Research Questions
                                • Scope and Delimitations
                                • Method Description
                                  • Frame of Reference
                                    • Filtration amp Clogging Indicators
                                      • Basket Filter
                                      • Self-Cleaning Basket Filters
                                      • Manometer
                                      • The Clogging Phenomena
                                      • Physics-based Modelling
                                        • Predictive Analytics
                                          • Classification Error Metrics
                                          • Regression Error Metrics
                                          • Stochastic Time Series Models
                                            • Neural Networks
                                              • Overview
                                              • The Perceptron
                                              • Activation functions
                                              • Neural Network Architectures
                                                  • Experimental Development
                                                    • Data Gathering and Processing
                                                    • Model Generation
                                                      • Regression Processing with the LSTM Model
                                                      • Regression Processing with the CNN Model
                                                      • Label Classification
                                                        • Model evaluation
                                                        • Hardware Specifications
                                                          • Results
                                                            • LSTM Performance
                                                            • CNN Performance
                                                              • Discussion amp Conclusion
                                                                • The LSTM Network
                                                                  • Regression Analysis
                                                                  • Classification Analysis
                                                                    • The CNN
                                                                      • Regression Analysis
                                                                      • Classification Analysis
                                                                        • Comparison Between Both Networks
                                                                        • Conclusion
                                                                          • Future Work
                                                                          • Bibliography

                                CHAPTER 2 FRAME OF REFERENCE

                                Comparing Darcyrsquos Equation 22 to Ergunrsquos Equation 24 the latter offers a deeperinsight in how alterations to variables affect the final differential pressure

                                22 Predictive AnalyticsUsing historical data to make predictions of future events is a field known as pre-dictive analytics Predictive analytics research covers statistical methods and tech-niques from areas such as predictive modelling data mining and ML in order toanalyse current and past information to make predictions on future events Havingbeen applied to other areas such as credit scoring [12] healthcare [13] and retailing[14] a similar approach of prediction has also been investigated in predictive main-tenance [15ndash17]

                                Predictive maintenance PdM includes methods and techniques that estimate anddetermine the condition of equipment or components to predict when maintenanceis required as opposed to traditional preventive maintenance which is based on theidea of performing routinely scheduled checks to avoid failures or breakdowns Thequality of predictive methods and algorithms is ensured by measuring the accuracyof the model in terms of correctly labelling the input data to its respective outputalso known as classification Every prediction comes with four possible outputs thatcan be visualised in a table also known as a confusion matrix as shown in Table22

                                Table 22 Outputs of a confusion matrix

                                PredictionPositive Negative

                                Act

                                ual Positive True Positive (TP) False Positive (FP)

                                Negative False Negative (FN) True Negative (TN)

                                The definition of the accuracy is the percentage of instances where a sample isclassified correctly and can be obtained as done by Konig [18]

                                ACC =sumn

                                i=1 jin

                                where ji =

                                1 if yi = yi

                                0 if yi 6= yi

                                (25)

                                by comparing the actual value yi and the predicted value yi for a group of sam-ples n However by using the overall accuracy as an error metric two flaws mayarise Provost et al [19] argue that accuracy as an error metric and classificationtool assumes that the supplied data are the true class distribution data and thatthe penalty of misclassification is equal for all classes Same claims are backed bySpiegel et al [20] which presents that ignoring the severity of individual problemsto achieve higher accuracy of failure classification may have a direct negative impacton the economic cost of maintenance due to ignored FPs and FNs

                                10

                                22 PREDICTIVE ANALYTICS

                                In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

                                221 Classification Error Metrics

                                Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

                                Area Under Curve (AUC)

                                AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

                                sensitivity = TP

                                TP + FN(26)

                                specificity = TN

                                TN + FP(27)

                                The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

                                F1 Score

                                The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

                                precision = TP

                                TP + FP(28)

                                recall = TP

                                TP + FN(29)

                                F1 = 2times precisiontimes recallprecision+ recall

                                (210)

                                11

                                CHAPTER 2 FRAME OF REFERENCE

                                Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

                                Logarithmic Loss (Log Loss)

                                For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

                                LogLoss = minusMsum

                                c=1yoclog(poc) (211)

                                222 Regression Error Metrics

                                Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

                                Mean Absolute Error (MAE)

                                Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

                                MAE = 1n

                                nsumi=1|yi minus yi| (212)

                                Mean Squared Error (MSE)

                                The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

                                12

                                22 PREDICTIVE ANALYTICS

                                MSE = 1n

                                nsumi=1

                                (yi minus yi)2 (213)

                                Root Mean Squared Error (RMSE)

                                RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

                                RMSE =

                                radicradicradicradic 1n

                                nsumi=1

                                (yi minus yi)2 (214)

                                The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

                                partRMSE

                                partyi= 1radic

                                MSE

                                partMSE

                                partyi(215)

                                Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

                                Mean Square Percentage Error (MSPE)

                                The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

                                MSPE = 100n

                                nsumi=1

                                (yi minus yi

                                yi

                                )2(216)

                                Mean Absolute Percentage Error (MAPE)

                                The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

                                MAPE = 100n

                                nsumi=1

                                ∣∣∣∣yi minus yi

                                yi

                                ∣∣∣∣ (217)

                                13

                                CHAPTER 2 FRAME OF REFERENCE

                                Coefficient of Determination r2

                                To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

                                r2 =

                                sumni=1((yi minus yi)(yi minus yi))2radicsumn

                                i=1(yi minus yi)2sumni=1(yi minus yi)2

                                2

                                (218)

                                r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

                                Adjusted r2

                                Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

                                r2adj = 1minus

                                [(1minusr2)(nminus1)

                                nminuskminus1

                                ](219)

                                Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

                                223 Stochastic Time Series Models

                                Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

                                Autoregressive Moving Average (ARMA)

                                The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

                                14

                                23 NEURAL NETWORKS

                                value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

                                Autoregressive Integrated Moving Average (ARIMA)

                                ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

                                A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

                                23 Neural Networks

                                231 Overview

                                NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

                                15

                                CHAPTER 2 FRAME OF REFERENCE

                                properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

                                232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

                                output =

                                0 if w middot x+ b le 01 if w middot x+ b gt 0

                                (220)

                                In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

                                233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

                                Sigmoid Function

                                The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

                                f(z) = σ(z) = 11 + eminusz

                                (221)

                                for

                                z =sum

                                j

                                wj middot xj + b (222)

                                16

                                23 NEURAL NETWORKS

                                Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

                                Rectified Function

                                The rectifier activation function is defined as the positive part of its argument [34]

                                f(x) = x+ = max(0 x) (223)

                                for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

                                Swish Function

                                Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

                                f(x) = x middot sigmoid(βx) (224)

                                where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

                                234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

                                Shallow Neural Networks (SNN)

                                SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

                                17

                                CHAPTER 2 FRAME OF REFERENCE

                                tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

                                Deep Neural Networks (DNN)

                                DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

                                f(x) = f (1) + f (2) + + f (n) (225)

                                where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

                                Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

                                Recurring Neural Networks(RNN)

                                Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

                                x1 =[0 0 1 1 0 0 0

                                ]x2 =

                                [0 0 0 1 1 0 0

                                ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

                                18

                                23 NEURAL NETWORKS

                                weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

                                Long Short Term Memory (LSTM) Networks

                                In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

                                it = σ(ωi

                                [htminus1 xt

                                ]+ bi)

                                ot = σ(ωo

                                [htminus1 xt

                                ]+ bo)

                                ft = σ(ωf

                                [htminus1 xt

                                ]+ bf )

                                (226)

                                The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

                                Gated Recurrent Units (GRU)

                                GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

                                19

                                CHAPTER 2 FRAME OF REFERENCE

                                Convolutional Neural Networks (CNN)

                                The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                                The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                                Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                                Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                                20

                                23 NEURAL NETWORKS

                                Figure 25 A max pooling layer with pool size 2 pooling an input

                                The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                                Figure 26 A flattening layer flattening the feature map

                                21

                                Chapter 3

                                Experimental Development

                                This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                                31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                                Figure 31 A complete test cycle

                                23

                                CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                                Figure 32 A test cycle with the backflush stop cut from the data

                                The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                                24

                                31 DATA GATHERING AND PROCESSING

                                Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                                Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                                Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                                As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                                25

                                CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                                Table 31 Amount of data available after preprocessing

                                Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                                Total 3195 1012 2903

                                When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                                32 Model Generation

                                In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                                Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                                The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                                26

                                32 MODEL GENERATION

                                variables The encoding can be done for both integers and tags such as123

                                rarr1 0 0

                                0 1 00 0 1

                                or

                                redbluegreen

                                rarr1 0 0

                                0 1 00 0 1

                                so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                                The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                                xi minusmin(x)max(x)minusmin(x) (31)

                                Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                                321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                                X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                                ](32)

                                X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                                ](33)

                                27

                                CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                                bull Samples - The amount of data points

                                bull Time steps - The points of observation of the samples

                                bull Features - The observed variables per time step

                                The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                                Figure 35 An overview of the LSTM network architecture

                                The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                                322 Regression Processing with the CNN Model

                                As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                                28

                                32 MODEL GENERATION

                                observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                Figure 36 An overview of the CNN architecture

                                Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                323 Label Classification

                                With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                29

                                CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                33 Model evaluation

                                During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                30

                                34 HARDWARE SPECIFICATIONS

                                Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                34 Hardware Specifications

                                The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                31

                                CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                32

                                Chapter 4

                                Results

                                This chapter presents the results for all the models presented in the previous chapter

                                41 LSTM Performance

                                Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                Figure 41 MAE and MSE loss for the LSTM

                                33

                                CHAPTER 4 RESULTS

                                Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                34

                                41 LSTM PERFORMANCE

                                Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                Table 41 Evaluation metrics for the LSTM during regression analysis

                                Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                35

                                CHAPTER 4 RESULTS

                                Table 42 Evaluation metrics for the LSTM during classification analysis

                                of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                Table 43 LSTM confusion matrix

                                PredictionLabel 1 Label 2

                                Act

                                ual Label 1 109 1

                                Label 2 3 669

                                42 CNN Performance

                                Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                Figure 47 MAE and MSE loss for the CNN

                                36

                                42 CNN PERFORMANCE

                                Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                37

                                CHAPTER 4 RESULTS

                                Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                Table 44 Evaluation metrics for the CNN during regression analysis

                                Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                38

                                42 CNN PERFORMANCE

                                Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                Table 45 Evaluation metrics for the CNN during classification analysis

                                Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                Table 46 CNN confusion matrix for data from the MAE regression network

                                PredictionLabel 1 Label 2

                                Act

                                ual Label 1 82 29

                                Label 2 38 631

                                Table 47 CNN confusion matrix for data from the MSE regression network

                                PredictionLabel 1 Label 2

                                Act

                                ual Label 1 69 41

                                Label 2 11 659

                                39

                                Chapter 5

                                Discussion amp Conclusion

                                This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                51 The LSTM Network

                                511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                41

                                CHAPTER 5 DISCUSSION amp CONCLUSION

                                while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                512 Classification Analysis

                                As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                52 The CNN

                                521 Regression Analysis

                                The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                42

                                52 THE CNN

                                is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                522 Classification Analysis

                                Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                However the CNN is still doing a good job at predicting future clogging even

                                43

                                CHAPTER 5 DISCUSSION amp CONCLUSION

                                up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                44

                                Chapter 6

                                Future Work

                                In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                45

                                Bibliography

                                [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                47

                                BIBLIOGRAPHY

                                [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                48

                                BIBLIOGRAPHY

                                [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                49

                                BIBLIOGRAPHY

                                models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                50

                                BIBLIOGRAPHY

                                [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                51

                                TRITA TRITA-ITM-EX 2019606

                                wwwkthse

                                • Introduction
                                  • Background
                                  • Problem Description
                                  • Purpose Definitions amp Research Questions
                                  • Scope and Delimitations
                                  • Method Description
                                    • Frame of Reference
                                      • Filtration amp Clogging Indicators
                                        • Basket Filter
                                        • Self-Cleaning Basket Filters
                                        • Manometer
                                        • The Clogging Phenomena
                                        • Physics-based Modelling
                                          • Predictive Analytics
                                            • Classification Error Metrics
                                            • Regression Error Metrics
                                            • Stochastic Time Series Models
                                              • Neural Networks
                                                • Overview
                                                • The Perceptron
                                                • Activation functions
                                                • Neural Network Architectures
                                                    • Experimental Development
                                                      • Data Gathering and Processing
                                                      • Model Generation
                                                        • Regression Processing with the LSTM Model
                                                        • Regression Processing with the CNN Model
                                                        • Label Classification
                                                          • Model evaluation
                                                          • Hardware Specifications
                                                            • Results
                                                              • LSTM Performance
                                                              • CNN Performance
                                                                • Discussion amp Conclusion
                                                                  • The LSTM Network
                                                                    • Regression Analysis
                                                                    • Classification Analysis
                                                                      • The CNN
                                                                        • Regression Analysis
                                                                        • Classification Analysis
                                                                          • Comparison Between Both Networks
                                                                          • Conclusion
                                                                            • Future Work
                                                                            • Bibliography

                                  22 PREDICTIVE ANALYTICS

                                  In order to better evaluate all data various error metrics have been developedThe various metrics can be placed in two different categories classification errormetrics and regression error metrics

                                  221 Classification Error Metrics

                                  Classification error metrics assume that the used input data are not optimised andthat basic classification assumptions are rarely true for real world problems

                                  Area Under Curve (AUC)

                                  AUC observes the area under a receiver operating characteristic curve also knownas a ROC curve A ROC curve measures the relationship between the true positiverate and the false positive rate and plots them against each other [18] True positiverate is in ML literature commonly referred to as sensitivity and the false positiverate is referred to as specificity Both rates are represented by Equations 26 and27 respectively

                                  sensitivity = TP

                                  TP + FN(26)

                                  specificity = TN

                                  TN + FP(27)

                                  The sensitivity on the y-axis and the specificity on the x-axis then give the AUCplot where every correctly classified true positive generates a step in the y-directionand every correctly classified false positive generates a step in the x-direction TheAUC curve area is limited by the range 0 to 1 where a higher value means a wellperforming model

                                  F1 Score

                                  The F1 score is a measurement to evaluate how many samples the classifier classifiescorrectly and how robust it is to not misclassify a number of samples [21] For F1score precision is referred to as the percentage of correctly classified samples andrecall is referred to as the percentage of actual correct classification [22] Precisionrecall and F1 score are obtained through

                                  precision = TP

                                  TP + FP(28)

                                  recall = TP

                                  TP + FN(29)

                                  F1 = 2times precisiontimes recallprecision+ recall

                                  (210)

                                  11

                                  CHAPTER 2 FRAME OF REFERENCE

                                  Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

                                  Logarithmic Loss (Log Loss)

                                  For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

                                  LogLoss = minusMsum

                                  c=1yoclog(poc) (211)

                                  222 Regression Error Metrics

                                  Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

                                  Mean Absolute Error (MAE)

                                  Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

                                  MAE = 1n

                                  nsumi=1|yi minus yi| (212)

                                  Mean Squared Error (MSE)

                                  The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

                                  12

                                  22 PREDICTIVE ANALYTICS

                                  MSE = 1n

                                  nsumi=1

                                  (yi minus yi)2 (213)

                                  Root Mean Squared Error (RMSE)

                                  RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

                                  RMSE =

                                  radicradicradicradic 1n

                                  nsumi=1

                                  (yi minus yi)2 (214)

                                  The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

                                  partRMSE

                                  partyi= 1radic

                                  MSE

                                  partMSE

                                  partyi(215)

                                  Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

                                  Mean Square Percentage Error (MSPE)

                                  The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

                                  MSPE = 100n

                                  nsumi=1

                                  (yi minus yi

                                  yi

                                  )2(216)

                                  Mean Absolute Percentage Error (MAPE)

                                  The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

                                  MAPE = 100n

                                  nsumi=1

                                  ∣∣∣∣yi minus yi

                                  yi

                                  ∣∣∣∣ (217)

                                  13

                                  CHAPTER 2 FRAME OF REFERENCE

                                  Coefficient of Determination r2

                                  To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

                                  r2 =

                                  sumni=1((yi minus yi)(yi minus yi))2radicsumn

                                  i=1(yi minus yi)2sumni=1(yi minus yi)2

                                  2

                                  (218)

                                  r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

                                  Adjusted r2

                                  Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

                                  r2adj = 1minus

                                  [(1minusr2)(nminus1)

                                  nminuskminus1

                                  ](219)

                                  Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

                                  223 Stochastic Time Series Models

                                  Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

                                  Autoregressive Moving Average (ARMA)

                                  The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

                                  14

                                  23 NEURAL NETWORKS

                                  value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

                                  Autoregressive Integrated Moving Average (ARIMA)

                                  ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

                                  A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

                                  23 Neural Networks

                                  231 Overview

                                  NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

                                  15

                                  CHAPTER 2 FRAME OF REFERENCE

                                  properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

                                  232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

                                  output =

                                  0 if w middot x+ b le 01 if w middot x+ b gt 0

                                  (220)

                                  In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

                                  233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

                                  Sigmoid Function

                                  The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

                                  f(z) = σ(z) = 11 + eminusz

                                  (221)

                                  for

                                  z =sum

                                  j

                                  wj middot xj + b (222)

                                  16

                                  23 NEURAL NETWORKS

                                  Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

                                  Rectified Function

                                  The rectifier activation function is defined as the positive part of its argument [34]

                                  f(x) = x+ = max(0 x) (223)

                                  for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

                                  Swish Function

                                  Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

                                  f(x) = x middot sigmoid(βx) (224)

                                  where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

                                  234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

                                  Shallow Neural Networks (SNN)

                                  SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

                                  17

                                  CHAPTER 2 FRAME OF REFERENCE

                                  tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

                                  Deep Neural Networks (DNN)

                                  DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

                                  f(x) = f (1) + f (2) + + f (n) (225)

                                  where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

                                  Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

                                  Recurring Neural Networks(RNN)

                                  Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

                                  x1 =[0 0 1 1 0 0 0

                                  ]x2 =

                                  [0 0 0 1 1 0 0

                                  ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

                                  18

                                  23 NEURAL NETWORKS

                                  weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

                                  Long Short Term Memory (LSTM) Networks

                                  In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

                                  it = σ(ωi

                                  [htminus1 xt

                                  ]+ bi)

                                  ot = σ(ωo

                                  [htminus1 xt

                                  ]+ bo)

                                  ft = σ(ωf

                                  [htminus1 xt

                                  ]+ bf )

                                  (226)

                                  The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

                                  Gated Recurrent Units (GRU)

                                  GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

                                  19

                                  CHAPTER 2 FRAME OF REFERENCE

                                  Convolutional Neural Networks (CNN)

                                  The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                                  The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                                  Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                                  Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                                  20

                                  23 NEURAL NETWORKS

                                  Figure 25 A max pooling layer with pool size 2 pooling an input

                                  The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                                  Figure 26 A flattening layer flattening the feature map

                                  21

                                  Chapter 3

                                  Experimental Development

                                  This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                                  31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                                  Figure 31 A complete test cycle

                                  23

                                  CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                  During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                                  Figure 32 A test cycle with the backflush stop cut from the data

                                  The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                                  24

                                  31 DATA GATHERING AND PROCESSING

                                  Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                                  Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                                  Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                                  As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                                  25

                                  CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                  the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                                  Table 31 Amount of data available after preprocessing

                                  Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                                  Total 3195 1012 2903

                                  When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                                  32 Model Generation

                                  In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                                  Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                                  The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                                  26

                                  32 MODEL GENERATION

                                  variables The encoding can be done for both integers and tags such as123

                                  rarr1 0 0

                                  0 1 00 0 1

                                  or

                                  redbluegreen

                                  rarr1 0 0

                                  0 1 00 0 1

                                  so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                                  The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                                  xi minusmin(x)max(x)minusmin(x) (31)

                                  Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                                  321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                                  X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                                  ](32)

                                  X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                                  ](33)

                                  27

                                  CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                  When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                                  bull Samples - The amount of data points

                                  bull Time steps - The points of observation of the samples

                                  bull Features - The observed variables per time step

                                  The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                                  Figure 35 An overview of the LSTM network architecture

                                  The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                                  322 Regression Processing with the CNN Model

                                  As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                                  28

                                  32 MODEL GENERATION

                                  observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                  The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                  Figure 36 An overview of the CNN architecture

                                  Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                  323 Label Classification

                                  With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                  For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                  29

                                  CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                  20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                  For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                  33 Model evaluation

                                  During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                  For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                  For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                  30

                                  34 HARDWARE SPECIFICATIONS

                                  Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                  34 Hardware Specifications

                                  The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                  Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                  31

                                  CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                  The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                  The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                  32

                                  Chapter 4

                                  Results

                                  This chapter presents the results for all the models presented in the previous chapter

                                  41 LSTM Performance

                                  Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                  Figure 41 MAE and MSE loss for the LSTM

                                  33

                                  CHAPTER 4 RESULTS

                                  Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                  Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                  Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                  34

                                  41 LSTM PERFORMANCE

                                  Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                  Table 41 Evaluation metrics for the LSTM during regression analysis

                                  Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                  Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                  Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                  35

                                  CHAPTER 4 RESULTS

                                  Table 42 Evaluation metrics for the LSTM during classification analysis

                                  of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                  Table 43 LSTM confusion matrix

                                  PredictionLabel 1 Label 2

                                  Act

                                  ual Label 1 109 1

                                  Label 2 3 669

                                  42 CNN Performance

                                  Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                  Figure 47 MAE and MSE loss for the CNN

                                  36

                                  42 CNN PERFORMANCE

                                  Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                  Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                  Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                  37

                                  CHAPTER 4 RESULTS

                                  Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                  Table 44 Evaluation metrics for the CNN during regression analysis

                                  Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                  Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                  Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                  38

                                  42 CNN PERFORMANCE

                                  Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                  Table 45 Evaluation metrics for the CNN during classification analysis

                                  Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                  Table 46 CNN confusion matrix for data from the MAE regression network

                                  PredictionLabel 1 Label 2

                                  Act

                                  ual Label 1 82 29

                                  Label 2 38 631

                                  Table 47 CNN confusion matrix for data from the MSE regression network

                                  PredictionLabel 1 Label 2

                                  Act

                                  ual Label 1 69 41

                                  Label 2 11 659

                                  39

                                  Chapter 5

                                  Discussion amp Conclusion

                                  This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                  51 The LSTM Network

                                  511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                  Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                  The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                  41

                                  CHAPTER 5 DISCUSSION amp CONCLUSION

                                  while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                  512 Classification Analysis

                                  As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                  The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                  52 The CNN

                                  521 Regression Analysis

                                  The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                  Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                  42

                                  52 THE CNN

                                  is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                  Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                  522 Classification Analysis

                                  Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                  Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                  However the CNN is still doing a good job at predicting future clogging even

                                  43

                                  CHAPTER 5 DISCUSSION amp CONCLUSION

                                  up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                  53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                  54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                  As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                  44

                                  Chapter 6

                                  Future Work

                                  In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                  For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                  On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                  Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                  45

                                  Bibliography

                                  [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                  [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                  [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                  [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                  [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                  [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                  [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                  [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                  [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                  [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                  47

                                  BIBLIOGRAPHY

                                  [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                  [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                  [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                  [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                  [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                  [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                  [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                  [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                  [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                  [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                  [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                  48

                                  BIBLIOGRAPHY

                                  [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                  [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                  [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                  [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                  [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                  [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                  [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                  [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                  [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                  [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                  [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                  [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                  49

                                  BIBLIOGRAPHY

                                  models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                  [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                  [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                  [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                  [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                  [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                  [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                  [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                  [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                  [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                  [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                  50

                                  BIBLIOGRAPHY

                                  [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                  [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                  [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                  [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                  [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                  [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                  [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                  51

                                  TRITA TRITA-ITM-EX 2019606

                                  wwwkthse

                                  • Introduction
                                    • Background
                                    • Problem Description
                                    • Purpose Definitions amp Research Questions
                                    • Scope and Delimitations
                                    • Method Description
                                      • Frame of Reference
                                        • Filtration amp Clogging Indicators
                                          • Basket Filter
                                          • Self-Cleaning Basket Filters
                                          • Manometer
                                          • The Clogging Phenomena
                                          • Physics-based Modelling
                                            • Predictive Analytics
                                              • Classification Error Metrics
                                              • Regression Error Metrics
                                              • Stochastic Time Series Models
                                                • Neural Networks
                                                  • Overview
                                                  • The Perceptron
                                                  • Activation functions
                                                  • Neural Network Architectures
                                                      • Experimental Development
                                                        • Data Gathering and Processing
                                                        • Model Generation
                                                          • Regression Processing with the LSTM Model
                                                          • Regression Processing with the CNN Model
                                                          • Label Classification
                                                            • Model evaluation
                                                            • Hardware Specifications
                                                              • Results
                                                                • LSTM Performance
                                                                • CNN Performance
                                                                  • Discussion amp Conclusion
                                                                    • The LSTM Network
                                                                      • Regression Analysis
                                                                      • Classification Analysis
                                                                        • The CNN
                                                                          • Regression Analysis
                                                                          • Classification Analysis
                                                                            • Comparison Between Both Networks
                                                                            • Conclusion
                                                                              • Future Work
                                                                              • Bibliography

                                    CHAPTER 2 FRAME OF REFERENCE

                                    Higher precision but lower recall means a very accurate prediction but the classifierwould miss hard to instances that are difficult to classify F1 score attempts tobalance the precision and the recall and a higher F1 score means that the model isperforming very well The F1 score itself is limited by the range 0 to 1

                                    Logarithmic Loss (Log Loss)

                                    For multi-class classification Log Loss is especially useful as it penalises false clas-sification A lower value of Log Loss means an increase of classification accuracyfor the multi-class dataset The Log Loss is determined through a binary indicatory of whether the class label c is the correct classification for an observation o andthe probability p which is the modelrsquos predicted probability that an observation obelongs to the class c [23] The log loss can be calculated through

                                    LogLoss = minusMsum

                                    c=1yoclog(poc) (211)

                                    222 Regression Error Metrics

                                    Regression error metrics evaluate how well the predicted value matches the actualvalue based on the idea that there is a relationship or a pattern between the a setof inputs and an outcome

                                    Mean Absolute Error (MAE)

                                    Comparing the average difference between the actual values and the predicted valuesgives the mean absolute error The MAE gives a score of how far away all predictionsare from the actual values [24] While not giving any insight in if the data arebeing over predicted or under predicted MAE is still a good tool for overall modelestimation Mathematically the MAE is expressed as

                                    MAE = 1n

                                    nsumi=1|yi minus yi| (212)

                                    Mean Squared Error (MSE)

                                    The MSE is similar to the MAE but rather than taking the average of the differencebetween the predicted and the actual results it takes the square of the differenceinstead Using the squared values the larger errors become more prominent incomparison to smaller errors resulting in a model that can better focus on theprediction error of larger errors However if a certain prediction turns out to bevery bad the overall model error will be skewed towards being worse than what mayactually be true [25] The MSE is calculated through

                                    12

                                    22 PREDICTIVE ANALYTICS

                                    MSE = 1n

                                    nsumi=1

                                    (yi minus yi)2 (213)

                                    Root Mean Squared Error (RMSE)

                                    RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

                                    RMSE =

                                    radicradicradicradic 1n

                                    nsumi=1

                                    (yi minus yi)2 (214)

                                    The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

                                    partRMSE

                                    partyi= 1radic

                                    MSE

                                    partMSE

                                    partyi(215)

                                    Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

                                    Mean Square Percentage Error (MSPE)

                                    The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

                                    MSPE = 100n

                                    nsumi=1

                                    (yi minus yi

                                    yi

                                    )2(216)

                                    Mean Absolute Percentage Error (MAPE)

                                    The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

                                    MAPE = 100n

                                    nsumi=1

                                    ∣∣∣∣yi minus yi

                                    yi

                                    ∣∣∣∣ (217)

                                    13

                                    CHAPTER 2 FRAME OF REFERENCE

                                    Coefficient of Determination r2

                                    To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

                                    r2 =

                                    sumni=1((yi minus yi)(yi minus yi))2radicsumn

                                    i=1(yi minus yi)2sumni=1(yi minus yi)2

                                    2

                                    (218)

                                    r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

                                    Adjusted r2

                                    Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

                                    r2adj = 1minus

                                    [(1minusr2)(nminus1)

                                    nminuskminus1

                                    ](219)

                                    Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

                                    223 Stochastic Time Series Models

                                    Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

                                    Autoregressive Moving Average (ARMA)

                                    The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

                                    14

                                    23 NEURAL NETWORKS

                                    value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

                                    Autoregressive Integrated Moving Average (ARIMA)

                                    ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

                                    A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

                                    23 Neural Networks

                                    231 Overview

                                    NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

                                    15

                                    CHAPTER 2 FRAME OF REFERENCE

                                    properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

                                    232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

                                    output =

                                    0 if w middot x+ b le 01 if w middot x+ b gt 0

                                    (220)

                                    In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

                                    233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

                                    Sigmoid Function

                                    The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

                                    f(z) = σ(z) = 11 + eminusz

                                    (221)

                                    for

                                    z =sum

                                    j

                                    wj middot xj + b (222)

                                    16

                                    23 NEURAL NETWORKS

                                    Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

                                    Rectified Function

                                    The rectifier activation function is defined as the positive part of its argument [34]

                                    f(x) = x+ = max(0 x) (223)

                                    for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

                                    Swish Function

                                    Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

                                    f(x) = x middot sigmoid(βx) (224)

                                    where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

                                    234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

                                    Shallow Neural Networks (SNN)

                                    SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

                                    17

                                    CHAPTER 2 FRAME OF REFERENCE

                                    tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

                                    Deep Neural Networks (DNN)

                                    DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

                                    f(x) = f (1) + f (2) + + f (n) (225)

                                    where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

                                    Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

                                    Recurring Neural Networks(RNN)

                                    Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

                                    x1 =[0 0 1 1 0 0 0

                                    ]x2 =

                                    [0 0 0 1 1 0 0

                                    ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

                                    18

                                    23 NEURAL NETWORKS

                                    weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

                                    Long Short Term Memory (LSTM) Networks

                                    In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

                                    it = σ(ωi

                                    [htminus1 xt

                                    ]+ bi)

                                    ot = σ(ωo

                                    [htminus1 xt

                                    ]+ bo)

                                    ft = σ(ωf

                                    [htminus1 xt

                                    ]+ bf )

                                    (226)

                                    The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

                                    Gated Recurrent Units (GRU)

                                    GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

                                    19

                                    CHAPTER 2 FRAME OF REFERENCE

                                    Convolutional Neural Networks (CNN)

                                    The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                                    The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                                    Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                                    Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                                    20

                                    23 NEURAL NETWORKS

                                    Figure 25 A max pooling layer with pool size 2 pooling an input

                                    The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                                    Figure 26 A flattening layer flattening the feature map

                                    21

                                    Chapter 3

                                    Experimental Development

                                    This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                                    31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                                    Figure 31 A complete test cycle

                                    23

                                    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                    During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                                    Figure 32 A test cycle with the backflush stop cut from the data

                                    The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                                    24

                                    31 DATA GATHERING AND PROCESSING

                                    Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                                    Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                                    Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                                    As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                                    25

                                    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                    the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                                    Table 31 Amount of data available after preprocessing

                                    Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                                    Total 3195 1012 2903

                                    When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                                    32 Model Generation

                                    In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                                    Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                                    The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                                    26

                                    32 MODEL GENERATION

                                    variables The encoding can be done for both integers and tags such as123

                                    rarr1 0 0

                                    0 1 00 0 1

                                    or

                                    redbluegreen

                                    rarr1 0 0

                                    0 1 00 0 1

                                    so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                                    The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                                    xi minusmin(x)max(x)minusmin(x) (31)

                                    Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                                    321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                                    X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                                    ](32)

                                    X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                                    ](33)

                                    27

                                    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                    When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                                    bull Samples - The amount of data points

                                    bull Time steps - The points of observation of the samples

                                    bull Features - The observed variables per time step

                                    The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                                    Figure 35 An overview of the LSTM network architecture

                                    The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                                    322 Regression Processing with the CNN Model

                                    As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                                    28

                                    32 MODEL GENERATION

                                    observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                    The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                    Figure 36 An overview of the CNN architecture

                                    Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                    323 Label Classification

                                    With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                    For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                    29

                                    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                    20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                    For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                    33 Model evaluation

                                    During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                    For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                    For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                    30

                                    34 HARDWARE SPECIFICATIONS

                                    Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                    34 Hardware Specifications

                                    The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                    Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                    31

                                    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                    The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                    The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                    32

                                    Chapter 4

                                    Results

                                    This chapter presents the results for all the models presented in the previous chapter

                                    41 LSTM Performance

                                    Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                    Figure 41 MAE and MSE loss for the LSTM

                                    33

                                    CHAPTER 4 RESULTS

                                    Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                    Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                    Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                    34

                                    41 LSTM PERFORMANCE

                                    Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                    Table 41 Evaluation metrics for the LSTM during regression analysis

                                    Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                    Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                    Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                    35

                                    CHAPTER 4 RESULTS

                                    Table 42 Evaluation metrics for the LSTM during classification analysis

                                    of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                    Table 43 LSTM confusion matrix

                                    PredictionLabel 1 Label 2

                                    Act

                                    ual Label 1 109 1

                                    Label 2 3 669

                                    42 CNN Performance

                                    Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                    Figure 47 MAE and MSE loss for the CNN

                                    36

                                    42 CNN PERFORMANCE

                                    Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                    Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                    Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                    37

                                    CHAPTER 4 RESULTS

                                    Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                    Table 44 Evaluation metrics for the CNN during regression analysis

                                    Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                    Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                    Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                    38

                                    42 CNN PERFORMANCE

                                    Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                    Table 45 Evaluation metrics for the CNN during classification analysis

                                    Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                    Table 46 CNN confusion matrix for data from the MAE regression network

                                    PredictionLabel 1 Label 2

                                    Act

                                    ual Label 1 82 29

                                    Label 2 38 631

                                    Table 47 CNN confusion matrix for data from the MSE regression network

                                    PredictionLabel 1 Label 2

                                    Act

                                    ual Label 1 69 41

                                    Label 2 11 659

                                    39

                                    Chapter 5

                                    Discussion amp Conclusion

                                    This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                    51 The LSTM Network

                                    511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                    Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                    The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                    41

                                    CHAPTER 5 DISCUSSION amp CONCLUSION

                                    while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                    512 Classification Analysis

                                    As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                    The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                    52 The CNN

                                    521 Regression Analysis

                                    The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                    Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                    42

                                    52 THE CNN

                                    is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                    Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                    522 Classification Analysis

                                    Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                    Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                    However the CNN is still doing a good job at predicting future clogging even

                                    43

                                    CHAPTER 5 DISCUSSION amp CONCLUSION

                                    up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                    53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                    54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                    As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                    44

                                    Chapter 6

                                    Future Work

                                    In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                    For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                    On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                    Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                    45

                                    Bibliography

                                    [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                    [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                    [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                    [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                    [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                    [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                    [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                    [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                    [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                    [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                    47

                                    BIBLIOGRAPHY

                                    [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                    [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                    [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                    [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                    [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                    [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                    [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                    [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                    [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                    [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                    [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                    48

                                    BIBLIOGRAPHY

                                    [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                    [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                    [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                    [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                    [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                    [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                    [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                    [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                    [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                    [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                    [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                    [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                    49

                                    BIBLIOGRAPHY

                                    models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                    [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                    [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                    [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                    [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                    [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                    [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                    [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                    [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                    [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                    [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                    50

                                    BIBLIOGRAPHY

                                    [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                    [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                    [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                    [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                    [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                    [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                    [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                    51

                                    TRITA TRITA-ITM-EX 2019606

                                    wwwkthse

                                    • Introduction
                                      • Background
                                      • Problem Description
                                      • Purpose Definitions amp Research Questions
                                      • Scope and Delimitations
                                      • Method Description
                                        • Frame of Reference
                                          • Filtration amp Clogging Indicators
                                            • Basket Filter
                                            • Self-Cleaning Basket Filters
                                            • Manometer
                                            • The Clogging Phenomena
                                            • Physics-based Modelling
                                              • Predictive Analytics
                                                • Classification Error Metrics
                                                • Regression Error Metrics
                                                • Stochastic Time Series Models
                                                  • Neural Networks
                                                    • Overview
                                                    • The Perceptron
                                                    • Activation functions
                                                    • Neural Network Architectures
                                                        • Experimental Development
                                                          • Data Gathering and Processing
                                                          • Model Generation
                                                            • Regression Processing with the LSTM Model
                                                            • Regression Processing with the CNN Model
                                                            • Label Classification
                                                              • Model evaluation
                                                              • Hardware Specifications
                                                                • Results
                                                                  • LSTM Performance
                                                                  • CNN Performance
                                                                    • Discussion amp Conclusion
                                                                      • The LSTM Network
                                                                        • Regression Analysis
                                                                        • Classification Analysis
                                                                          • The CNN
                                                                            • Regression Analysis
                                                                            • Classification Analysis
                                                                              • Comparison Between Both Networks
                                                                              • Conclusion
                                                                                • Future Work
                                                                                • Bibliography

                                      22 PREDICTIVE ANALYTICS

                                      MSE = 1n

                                      nsumi=1

                                      (yi minus yi)2 (213)

                                      Root Mean Squared Error (RMSE)

                                      RMSE is simply the root of MSE The introduction of the square-root scales theerror to be the same scale as the targets

                                      RMSE =

                                      radicradicradicradic 1n

                                      nsumi=1

                                      (yi minus yi)2 (214)

                                      The major difference between MSE and RMSE is the flow over the gradients Trav-elling along the gradient of the MSE is equal to traveling along the gradient of theRMSE times a flow variable that depends on the MSE score This means that whenusing gradient based methods (further discussed in section 234) the two metricscannot be straight up interchanged

                                      partRMSE

                                      partyi= 1radic

                                      MSE

                                      partMSE

                                      partyi(215)

                                      Just like MSE RMSE has a hard time dealing with outliers and has for that reasonbeen considered as a bad metric [18] However Chai et al [26] argue that while nosingle metric can project all the model errors RMSE is still valid and well protectedagainst outliers with enough samples n

                                      Mean Square Percentage Error (MSPE)

                                      The mean square percentage error can be thought of as a weighted version of theMSE Every sample weight is inversely proportional to its respective target squareThe difference between MSE and MSPE is that MSE works with squared errorswhile MSPE considers the relative error [27]

                                      MSPE = 100n

                                      nsumi=1

                                      (yi minus yi

                                      yi

                                      )2(216)

                                      Mean Absolute Percentage Error (MAPE)

                                      The mean absolute percentage error is one of the most commonly and widely usedmeasures for forecast and prediction accuracy [28] The measurement is an aver-age of the absolute percentage errors between the actual values and the predictionvalues Like r2 MAPE is scale free and is obtaind through

                                      MAPE = 100n

                                      nsumi=1

                                      ∣∣∣∣yi minus yi

                                      yi

                                      ∣∣∣∣ (217)

                                      13

                                      CHAPTER 2 FRAME OF REFERENCE

                                      Coefficient of Determination r2

                                      To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

                                      r2 =

                                      sumni=1((yi minus yi)(yi minus yi))2radicsumn

                                      i=1(yi minus yi)2sumni=1(yi minus yi)2

                                      2

                                      (218)

                                      r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

                                      Adjusted r2

                                      Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

                                      r2adj = 1minus

                                      [(1minusr2)(nminus1)

                                      nminuskminus1

                                      ](219)

                                      Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

                                      223 Stochastic Time Series Models

                                      Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

                                      Autoregressive Moving Average (ARMA)

                                      The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

                                      14

                                      23 NEURAL NETWORKS

                                      value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

                                      Autoregressive Integrated Moving Average (ARIMA)

                                      ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

                                      A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

                                      23 Neural Networks

                                      231 Overview

                                      NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

                                      15

                                      CHAPTER 2 FRAME OF REFERENCE

                                      properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

                                      232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

                                      output =

                                      0 if w middot x+ b le 01 if w middot x+ b gt 0

                                      (220)

                                      In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

                                      233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

                                      Sigmoid Function

                                      The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

                                      f(z) = σ(z) = 11 + eminusz

                                      (221)

                                      for

                                      z =sum

                                      j

                                      wj middot xj + b (222)

                                      16

                                      23 NEURAL NETWORKS

                                      Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

                                      Rectified Function

                                      The rectifier activation function is defined as the positive part of its argument [34]

                                      f(x) = x+ = max(0 x) (223)

                                      for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

                                      Swish Function

                                      Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

                                      f(x) = x middot sigmoid(βx) (224)

                                      where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

                                      234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

                                      Shallow Neural Networks (SNN)

                                      SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

                                      17

                                      CHAPTER 2 FRAME OF REFERENCE

                                      tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

                                      Deep Neural Networks (DNN)

                                      DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

                                      f(x) = f (1) + f (2) + + f (n) (225)

                                      where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

                                      Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

                                      Recurring Neural Networks(RNN)

                                      Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

                                      x1 =[0 0 1 1 0 0 0

                                      ]x2 =

                                      [0 0 0 1 1 0 0

                                      ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

                                      18

                                      23 NEURAL NETWORKS

                                      weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

                                      Long Short Term Memory (LSTM) Networks

                                      In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

                                      it = σ(ωi

                                      [htminus1 xt

                                      ]+ bi)

                                      ot = σ(ωo

                                      [htminus1 xt

                                      ]+ bo)

                                      ft = σ(ωf

                                      [htminus1 xt

                                      ]+ bf )

                                      (226)

                                      The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

                                      Gated Recurrent Units (GRU)

                                      GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

                                      19

                                      CHAPTER 2 FRAME OF REFERENCE

                                      Convolutional Neural Networks (CNN)

                                      The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                                      The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                                      Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                                      Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                                      20

                                      23 NEURAL NETWORKS

                                      Figure 25 A max pooling layer with pool size 2 pooling an input

                                      The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                                      Figure 26 A flattening layer flattening the feature map

                                      21

                                      Chapter 3

                                      Experimental Development

                                      This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                                      31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                                      Figure 31 A complete test cycle

                                      23

                                      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                      During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                                      Figure 32 A test cycle with the backflush stop cut from the data

                                      The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                                      24

                                      31 DATA GATHERING AND PROCESSING

                                      Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                                      Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                                      Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                                      As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                                      25

                                      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                      the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                                      Table 31 Amount of data available after preprocessing

                                      Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                                      Total 3195 1012 2903

                                      When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                                      32 Model Generation

                                      In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                                      Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                                      The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                                      26

                                      32 MODEL GENERATION

                                      variables The encoding can be done for both integers and tags such as123

                                      rarr1 0 0

                                      0 1 00 0 1

                                      or

                                      redbluegreen

                                      rarr1 0 0

                                      0 1 00 0 1

                                      so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                                      The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                                      xi minusmin(x)max(x)minusmin(x) (31)

                                      Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                                      321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                                      X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                                      ](32)

                                      X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                                      ](33)

                                      27

                                      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                      When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                                      bull Samples - The amount of data points

                                      bull Time steps - The points of observation of the samples

                                      bull Features - The observed variables per time step

                                      The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                                      Figure 35 An overview of the LSTM network architecture

                                      The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                                      322 Regression Processing with the CNN Model

                                      As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                                      28

                                      32 MODEL GENERATION

                                      observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                      The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                      Figure 36 An overview of the CNN architecture

                                      Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                      323 Label Classification

                                      With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                      For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                      29

                                      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                      20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                      For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                      33 Model evaluation

                                      During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                      For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                      For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                      30

                                      34 HARDWARE SPECIFICATIONS

                                      Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                      34 Hardware Specifications

                                      The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                      Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                      31

                                      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                      The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                      The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                      32

                                      Chapter 4

                                      Results

                                      This chapter presents the results for all the models presented in the previous chapter

                                      41 LSTM Performance

                                      Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                      Figure 41 MAE and MSE loss for the LSTM

                                      33

                                      CHAPTER 4 RESULTS

                                      Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                      Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                      Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                      34

                                      41 LSTM PERFORMANCE

                                      Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                      Table 41 Evaluation metrics for the LSTM during regression analysis

                                      Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                      Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                      Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                      35

                                      CHAPTER 4 RESULTS

                                      Table 42 Evaluation metrics for the LSTM during classification analysis

                                      of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                      Table 43 LSTM confusion matrix

                                      PredictionLabel 1 Label 2

                                      Act

                                      ual Label 1 109 1

                                      Label 2 3 669

                                      42 CNN Performance

                                      Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                      Figure 47 MAE and MSE loss for the CNN

                                      36

                                      42 CNN PERFORMANCE

                                      Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                      Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                      Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                      37

                                      CHAPTER 4 RESULTS

                                      Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                      Table 44 Evaluation metrics for the CNN during regression analysis

                                      Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                      Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                      Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                      38

                                      42 CNN PERFORMANCE

                                      Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                      Table 45 Evaluation metrics for the CNN during classification analysis

                                      Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                      Table 46 CNN confusion matrix for data from the MAE regression network

                                      PredictionLabel 1 Label 2

                                      Act

                                      ual Label 1 82 29

                                      Label 2 38 631

                                      Table 47 CNN confusion matrix for data from the MSE regression network

                                      PredictionLabel 1 Label 2

                                      Act

                                      ual Label 1 69 41

                                      Label 2 11 659

                                      39

                                      Chapter 5

                                      Discussion amp Conclusion

                                      This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                      51 The LSTM Network

                                      511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                      Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                      The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                      41

                                      CHAPTER 5 DISCUSSION amp CONCLUSION

                                      while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                      512 Classification Analysis

                                      As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                      The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                      52 The CNN

                                      521 Regression Analysis

                                      The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                      Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                      42

                                      52 THE CNN

                                      is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                      Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                      522 Classification Analysis

                                      Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                      Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                      However the CNN is still doing a good job at predicting future clogging even

                                      43

                                      CHAPTER 5 DISCUSSION amp CONCLUSION

                                      up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                      53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                      54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                      As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                      44

                                      Chapter 6

                                      Future Work

                                      In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                      For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                      On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                      Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                      45

                                      Bibliography

                                      [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                      [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                      [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                      [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                      [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                      [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                      [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                      [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                      [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                      [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                      47

                                      BIBLIOGRAPHY

                                      [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                      [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                      [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                      [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                      [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                      [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                      [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                      [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                      [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                      [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                      [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                      48

                                      BIBLIOGRAPHY

                                      [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                      [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                      [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                      [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                      [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                      [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                      [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                      [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                      [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                      [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                      [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                      [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                      49

                                      BIBLIOGRAPHY

                                      models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                      [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                      [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                      [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                      [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                      [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                      [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                      [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                      [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                      [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                      [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                      50

                                      BIBLIOGRAPHY

                                      [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                      [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                      [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                      [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                      [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                      [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                      [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                      51

                                      TRITA TRITA-ITM-EX 2019606

                                      wwwkthse

                                      • Introduction
                                        • Background
                                        • Problem Description
                                        • Purpose Definitions amp Research Questions
                                        • Scope and Delimitations
                                        • Method Description
                                          • Frame of Reference
                                            • Filtration amp Clogging Indicators
                                              • Basket Filter
                                              • Self-Cleaning Basket Filters
                                              • Manometer
                                              • The Clogging Phenomena
                                              • Physics-based Modelling
                                                • Predictive Analytics
                                                  • Classification Error Metrics
                                                  • Regression Error Metrics
                                                  • Stochastic Time Series Models
                                                    • Neural Networks
                                                      • Overview
                                                      • The Perceptron
                                                      • Activation functions
                                                      • Neural Network Architectures
                                                          • Experimental Development
                                                            • Data Gathering and Processing
                                                            • Model Generation
                                                              • Regression Processing with the LSTM Model
                                                              • Regression Processing with the CNN Model
                                                              • Label Classification
                                                                • Model evaluation
                                                                • Hardware Specifications
                                                                  • Results
                                                                    • LSTM Performance
                                                                    • CNN Performance
                                                                      • Discussion amp Conclusion
                                                                        • The LSTM Network
                                                                          • Regression Analysis
                                                                          • Classification Analysis
                                                                            • The CNN
                                                                              • Regression Analysis
                                                                              • Classification Analysis
                                                                                • Comparison Between Both Networks
                                                                                • Conclusion
                                                                                  • Future Work
                                                                                  • Bibliography

                                        CHAPTER 2 FRAME OF REFERENCE

                                        Coefficient of Determination r2

                                        To determine the proportion of variability in the data captured by the evaluatedmodel the coefficient of determination can be used [18] This allows for comparinghow much better the model is in comparison to a constant baseline r2 is scale-freein comparison to MSE and RMSE and bound between minusinfin and 1 so it does notmatter if the output values are large or small the value will always be within thatrange A low r2 score means that the model is bad at fitting the data

                                        r2 =

                                        sumni=1((yi minus yi)(yi minus yi))2radicsumn

                                        i=1(yi minus yi)2sumni=1(yi minus yi)2

                                        2

                                        (218)

                                        r2 has some drawbacks It does not take into account if the fit coefficient estimatesand predictions are biased and when additional predictors are added to the modelthe r2-score will always increase simply because the new fit will have more termsThese issues are handled by adjusted r2

                                        Adjusted r2

                                        Adjusted r2 just like r2 indicates how well terms fit a curve or a line The differenceis that adjusted r2 will adjust for the number of terms or predictors in the model Ifmore variables are added that prove to be useless the score will decrease while thescore will increase if useful variables are added This leads adjusted r2 to alwaysbe less than or equal to r2 For n observations and k variables the adjusted r2 iscalculated through

                                        r2adj = 1minus

                                        [(1minusr2)(nminus1)

                                        nminuskminus1

                                        ](219)

                                        Adjusted r2 can therefore accurately show the percentage of variation of the in-dependent variables that affect the dependent variables Furthermore adding ofadditional independent variables that do not fit the model will penalize the modelaccuracy by lowering the score [29]

                                        223 Stochastic Time Series Models

                                        Stochastic models are tools for estimating probability distributions The modelsallow for random variation in one or more input variables at a time to generatedistributions of potential outcomes

                                        Autoregressive Moving Average (ARMA)

                                        The ARMA model is a combination of an autoregressive model (AR) and a movingaverage model (MA) An AR model assumes that a future value of a variable canbe predicted as a linear combination of past values of that variable plus a randomerror with a constant term The MA model is like the AR model in that its output

                                        14

                                        23 NEURAL NETWORKS

                                        value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

                                        Autoregressive Integrated Moving Average (ARIMA)

                                        ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

                                        A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

                                        23 Neural Networks

                                        231 Overview

                                        NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

                                        15

                                        CHAPTER 2 FRAME OF REFERENCE

                                        properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

                                        232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

                                        output =

                                        0 if w middot x+ b le 01 if w middot x+ b gt 0

                                        (220)

                                        In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

                                        233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

                                        Sigmoid Function

                                        The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

                                        f(z) = σ(z) = 11 + eminusz

                                        (221)

                                        for

                                        z =sum

                                        j

                                        wj middot xj + b (222)

                                        16

                                        23 NEURAL NETWORKS

                                        Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

                                        Rectified Function

                                        The rectifier activation function is defined as the positive part of its argument [34]

                                        f(x) = x+ = max(0 x) (223)

                                        for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

                                        Swish Function

                                        Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

                                        f(x) = x middot sigmoid(βx) (224)

                                        where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

                                        234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

                                        Shallow Neural Networks (SNN)

                                        SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

                                        17

                                        CHAPTER 2 FRAME OF REFERENCE

                                        tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

                                        Deep Neural Networks (DNN)

                                        DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

                                        f(x) = f (1) + f (2) + + f (n) (225)

                                        where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

                                        Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

                                        Recurring Neural Networks(RNN)

                                        Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

                                        x1 =[0 0 1 1 0 0 0

                                        ]x2 =

                                        [0 0 0 1 1 0 0

                                        ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

                                        18

                                        23 NEURAL NETWORKS

                                        weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

                                        Long Short Term Memory (LSTM) Networks

                                        In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

                                        it = σ(ωi

                                        [htminus1 xt

                                        ]+ bi)

                                        ot = σ(ωo

                                        [htminus1 xt

                                        ]+ bo)

                                        ft = σ(ωf

                                        [htminus1 xt

                                        ]+ bf )

                                        (226)

                                        The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

                                        Gated Recurrent Units (GRU)

                                        GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

                                        19

                                        CHAPTER 2 FRAME OF REFERENCE

                                        Convolutional Neural Networks (CNN)

                                        The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                                        The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                                        Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                                        Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                                        20

                                        23 NEURAL NETWORKS

                                        Figure 25 A max pooling layer with pool size 2 pooling an input

                                        The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                                        Figure 26 A flattening layer flattening the feature map

                                        21

                                        Chapter 3

                                        Experimental Development

                                        This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                                        31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                                        Figure 31 A complete test cycle

                                        23

                                        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                        During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                                        Figure 32 A test cycle with the backflush stop cut from the data

                                        The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                                        24

                                        31 DATA GATHERING AND PROCESSING

                                        Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                                        Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                                        Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                                        As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                                        25

                                        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                        the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                                        Table 31 Amount of data available after preprocessing

                                        Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                                        Total 3195 1012 2903

                                        When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                                        32 Model Generation

                                        In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                                        Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                                        The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                                        26

                                        32 MODEL GENERATION

                                        variables The encoding can be done for both integers and tags such as123

                                        rarr1 0 0

                                        0 1 00 0 1

                                        or

                                        redbluegreen

                                        rarr1 0 0

                                        0 1 00 0 1

                                        so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                                        The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                                        xi minusmin(x)max(x)minusmin(x) (31)

                                        Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                                        321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                                        X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                                        ](32)

                                        X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                                        ](33)

                                        27

                                        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                        When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                                        bull Samples - The amount of data points

                                        bull Time steps - The points of observation of the samples

                                        bull Features - The observed variables per time step

                                        The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                                        Figure 35 An overview of the LSTM network architecture

                                        The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                                        322 Regression Processing with the CNN Model

                                        As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                                        28

                                        32 MODEL GENERATION

                                        observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                        The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                        Figure 36 An overview of the CNN architecture

                                        Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                        323 Label Classification

                                        With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                        For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                        29

                                        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                        20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                        For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                        33 Model evaluation

                                        During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                        For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                        For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                        30

                                        34 HARDWARE SPECIFICATIONS

                                        Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                        34 Hardware Specifications

                                        The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                        Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                        31

                                        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                        The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                        The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                        32

                                        Chapter 4

                                        Results

                                        This chapter presents the results for all the models presented in the previous chapter

                                        41 LSTM Performance

                                        Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                        Figure 41 MAE and MSE loss for the LSTM

                                        33

                                        CHAPTER 4 RESULTS

                                        Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                        Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                        Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                        34

                                        41 LSTM PERFORMANCE

                                        Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                        Table 41 Evaluation metrics for the LSTM during regression analysis

                                        Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                        Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                        Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                        35

                                        CHAPTER 4 RESULTS

                                        Table 42 Evaluation metrics for the LSTM during classification analysis

                                        of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                        Table 43 LSTM confusion matrix

                                        PredictionLabel 1 Label 2

                                        Act

                                        ual Label 1 109 1

                                        Label 2 3 669

                                        42 CNN Performance

                                        Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                        Figure 47 MAE and MSE loss for the CNN

                                        36

                                        42 CNN PERFORMANCE

                                        Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                        Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                        Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                        37

                                        CHAPTER 4 RESULTS

                                        Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                        Table 44 Evaluation metrics for the CNN during regression analysis

                                        Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                        Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                        Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                        38

                                        42 CNN PERFORMANCE

                                        Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                        Table 45 Evaluation metrics for the CNN during classification analysis

                                        Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                        Table 46 CNN confusion matrix for data from the MAE regression network

                                        PredictionLabel 1 Label 2

                                        Act

                                        ual Label 1 82 29

                                        Label 2 38 631

                                        Table 47 CNN confusion matrix for data from the MSE regression network

                                        PredictionLabel 1 Label 2

                                        Act

                                        ual Label 1 69 41

                                        Label 2 11 659

                                        39

                                        Chapter 5

                                        Discussion amp Conclusion

                                        This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                        51 The LSTM Network

                                        511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                        Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                        The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                        41

                                        CHAPTER 5 DISCUSSION amp CONCLUSION

                                        while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                        512 Classification Analysis

                                        As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                        The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                        52 The CNN

                                        521 Regression Analysis

                                        The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                        Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                        42

                                        52 THE CNN

                                        is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                        Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                        522 Classification Analysis

                                        Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                        Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                        However the CNN is still doing a good job at predicting future clogging even

                                        43

                                        CHAPTER 5 DISCUSSION amp CONCLUSION

                                        up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                        53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                        54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                        As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                        44

                                        Chapter 6

                                        Future Work

                                        In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                        For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                        On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                        Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                        45

                                        Bibliography

                                        [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                        [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                        [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                        [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                        [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                        [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                        [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                        [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                        [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                        [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                        47

                                        BIBLIOGRAPHY

                                        [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                        [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                        [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                        [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                        [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                        [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                        [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                        [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                        [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                        [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                        [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                        48

                                        BIBLIOGRAPHY

                                        [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                        [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                        [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                        [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                        [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                        [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                        [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                        [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                        [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                        [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                        [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                        [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                        49

                                        BIBLIOGRAPHY

                                        models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                        [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                        [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                        [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                        [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                        [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                        [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                        [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                        [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                        [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                        [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                        50

                                        BIBLIOGRAPHY

                                        [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                        [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                        [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                        [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                        [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                        [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                        [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                        51

                                        TRITA TRITA-ITM-EX 2019606

                                        wwwkthse

                                        • Introduction
                                          • Background
                                          • Problem Description
                                          • Purpose Definitions amp Research Questions
                                          • Scope and Delimitations
                                          • Method Description
                                            • Frame of Reference
                                              • Filtration amp Clogging Indicators
                                                • Basket Filter
                                                • Self-Cleaning Basket Filters
                                                • Manometer
                                                • The Clogging Phenomena
                                                • Physics-based Modelling
                                                  • Predictive Analytics
                                                    • Classification Error Metrics
                                                    • Regression Error Metrics
                                                    • Stochastic Time Series Models
                                                      • Neural Networks
                                                        • Overview
                                                        • The Perceptron
                                                        • Activation functions
                                                        • Neural Network Architectures
                                                            • Experimental Development
                                                              • Data Gathering and Processing
                                                              • Model Generation
                                                                • Regression Processing with the LSTM Model
                                                                • Regression Processing with the CNN Model
                                                                • Label Classification
                                                                  • Model evaluation
                                                                  • Hardware Specifications
                                                                    • Results
                                                                      • LSTM Performance
                                                                      • CNN Performance
                                                                        • Discussion amp Conclusion
                                                                          • The LSTM Network
                                                                            • Regression Analysis
                                                                            • Classification Analysis
                                                                              • The CNN
                                                                                • Regression Analysis
                                                                                • Classification Analysis
                                                                                  • Comparison Between Both Networks
                                                                                  • Conclusion
                                                                                    • Future Work
                                                                                    • Bibliography

                                          23 NEURAL NETWORKS

                                          value depends on current and past values The difference between the two is thatwhile the AR model intends to model and predict the observed variable the MAmodel intends to model the error term as a linear combination of the error termsthat occur simultaneously and at different past times

                                          Autoregressive Integrated Moving Average (ARIMA)

                                          ARIMA is a generalisation of the autoregressive moving average model (ARMA)and can be fitted to time series data in order to obtain a better understanding ofthe data or be used as forecasting methods to predict future data points WhileARMA requires the data to be completely stationary ie the mean and variance donot change over time ARIMA can process non-stationary time series by removingthe non-stationary nature of the data This means that non-stationary time seriesdata must be processed before it can be modelled Removing the trend makes themean value of the data stationary something that is done by simply differencingthe series To make the series stationary on variance one of the best methods is toapply a log transform to the series Combining the two methods by differencing thelog transformed data makes the entire series stationary on both mean and varianceand allows for the dataset to be processed by the model

                                          A further developed version of the ARIMA model is the Seasonal ARIMA (SARIMA)model The SARIMA model applies a seasonal differencing of necessary order toremove non-stationarity from the time series ARIMAs and SARIMAs strengthis particularly identified as its ability to predict future data points for univariatetime series In a comparison published by Adhikari et al [30] a SARIMA model isseen to outperform both neural networks and support-vector machines in forecastestimation

                                          23 Neural Networks

                                          231 Overview

                                          NNs exist as a subgroup of the ML domain and are used in a range of fields such ascomputer vision predictive analytics medical diagnosis and more [31] An NN is apowerful tool for data analysis that similar to other ML programmes performs theirtasks based on inference and patterns rather than explicitly set instructions Thecognitive capabilities of NNs have been used for regression analysis and classificationin both supervised and unsupervised learning The NN passes some inputs from aninput layer to one or more hidden layers and then to an output layer The sizes ofthe input and the output layer are dependent on the input and the output dataEach node in the input layer corresponds to the available input data x and eachnode in the output layer corresponds to the desired output y The nodes are oftenreferred to as neurons and while the neurons in the input and output layers alwaysrepresent the supplied data the neurons in the hidden layer may have very different

                                          15

                                          CHAPTER 2 FRAME OF REFERENCE

                                          properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

                                          232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

                                          output =

                                          0 if w middot x+ b le 01 if w middot x+ b gt 0

                                          (220)

                                          In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

                                          233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

                                          Sigmoid Function

                                          The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

                                          f(z) = σ(z) = 11 + eminusz

                                          (221)

                                          for

                                          z =sum

                                          j

                                          wj middot xj + b (222)

                                          16

                                          23 NEURAL NETWORKS

                                          Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

                                          Rectified Function

                                          The rectifier activation function is defined as the positive part of its argument [34]

                                          f(x) = x+ = max(0 x) (223)

                                          for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

                                          Swish Function

                                          Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

                                          f(x) = x middot sigmoid(βx) (224)

                                          where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

                                          234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

                                          Shallow Neural Networks (SNN)

                                          SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

                                          17

                                          CHAPTER 2 FRAME OF REFERENCE

                                          tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

                                          Deep Neural Networks (DNN)

                                          DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

                                          f(x) = f (1) + f (2) + + f (n) (225)

                                          where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

                                          Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

                                          Recurring Neural Networks(RNN)

                                          Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

                                          x1 =[0 0 1 1 0 0 0

                                          ]x2 =

                                          [0 0 0 1 1 0 0

                                          ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

                                          18

                                          23 NEURAL NETWORKS

                                          weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

                                          Long Short Term Memory (LSTM) Networks

                                          In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

                                          it = σ(ωi

                                          [htminus1 xt

                                          ]+ bi)

                                          ot = σ(ωo

                                          [htminus1 xt

                                          ]+ bo)

                                          ft = σ(ωf

                                          [htminus1 xt

                                          ]+ bf )

                                          (226)

                                          The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

                                          Gated Recurrent Units (GRU)

                                          GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

                                          19

                                          CHAPTER 2 FRAME OF REFERENCE

                                          Convolutional Neural Networks (CNN)

                                          The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                                          The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                                          Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                                          Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                                          20

                                          23 NEURAL NETWORKS

                                          Figure 25 A max pooling layer with pool size 2 pooling an input

                                          The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                                          Figure 26 A flattening layer flattening the feature map

                                          21

                                          Chapter 3

                                          Experimental Development

                                          This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                                          31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                                          Figure 31 A complete test cycle

                                          23

                                          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                          During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                                          Figure 32 A test cycle with the backflush stop cut from the data

                                          The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                                          24

                                          31 DATA GATHERING AND PROCESSING

                                          Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                                          Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                                          Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                                          As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                                          25

                                          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                          the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                                          Table 31 Amount of data available after preprocessing

                                          Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                                          Total 3195 1012 2903

                                          When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                                          32 Model Generation

                                          In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                                          Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                                          The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                                          26

                                          32 MODEL GENERATION

                                          variables The encoding can be done for both integers and tags such as123

                                          rarr1 0 0

                                          0 1 00 0 1

                                          or

                                          redbluegreen

                                          rarr1 0 0

                                          0 1 00 0 1

                                          so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                                          The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                                          xi minusmin(x)max(x)minusmin(x) (31)

                                          Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                                          321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                                          X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                                          ](32)

                                          X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                                          ](33)

                                          27

                                          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                          When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                                          bull Samples - The amount of data points

                                          bull Time steps - The points of observation of the samples

                                          bull Features - The observed variables per time step

                                          The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                                          Figure 35 An overview of the LSTM network architecture

                                          The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                                          322 Regression Processing with the CNN Model

                                          As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                                          28

                                          32 MODEL GENERATION

                                          observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                          The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                          Figure 36 An overview of the CNN architecture

                                          Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                          323 Label Classification

                                          With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                          For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                          29

                                          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                          20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                          For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                          33 Model evaluation

                                          During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                          For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                          For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                          30

                                          34 HARDWARE SPECIFICATIONS

                                          Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                          34 Hardware Specifications

                                          The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                          Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                          31

                                          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                          The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                          The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                          32

                                          Chapter 4

                                          Results

                                          This chapter presents the results for all the models presented in the previous chapter

                                          41 LSTM Performance

                                          Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                          Figure 41 MAE and MSE loss for the LSTM

                                          33

                                          CHAPTER 4 RESULTS

                                          Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                          Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                          Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                          34

                                          41 LSTM PERFORMANCE

                                          Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                          Table 41 Evaluation metrics for the LSTM during regression analysis

                                          Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                          Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                          Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                          35

                                          CHAPTER 4 RESULTS

                                          Table 42 Evaluation metrics for the LSTM during classification analysis

                                          of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                          Table 43 LSTM confusion matrix

                                          PredictionLabel 1 Label 2

                                          Act

                                          ual Label 1 109 1

                                          Label 2 3 669

                                          42 CNN Performance

                                          Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                          Figure 47 MAE and MSE loss for the CNN

                                          36

                                          42 CNN PERFORMANCE

                                          Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                          Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                          Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                          37

                                          CHAPTER 4 RESULTS

                                          Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                          Table 44 Evaluation metrics for the CNN during regression analysis

                                          Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                          Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                          Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                          38

                                          42 CNN PERFORMANCE

                                          Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                          Table 45 Evaluation metrics for the CNN during classification analysis

                                          Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                          Table 46 CNN confusion matrix for data from the MAE regression network

                                          PredictionLabel 1 Label 2

                                          Act

                                          ual Label 1 82 29

                                          Label 2 38 631

                                          Table 47 CNN confusion matrix for data from the MSE regression network

                                          PredictionLabel 1 Label 2

                                          Act

                                          ual Label 1 69 41

                                          Label 2 11 659

                                          39

                                          Chapter 5

                                          Discussion amp Conclusion

                                          This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                          51 The LSTM Network

                                          511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                          Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                          The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                          41

                                          CHAPTER 5 DISCUSSION amp CONCLUSION

                                          while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                          512 Classification Analysis

                                          As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                          The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                          52 The CNN

                                          521 Regression Analysis

                                          The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                          Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                          42

                                          52 THE CNN

                                          is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                          Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                          522 Classification Analysis

                                          Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                          Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                          However the CNN is still doing a good job at predicting future clogging even

                                          43

                                          CHAPTER 5 DISCUSSION amp CONCLUSION

                                          up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                          53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                          54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                          As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                          44

                                          Chapter 6

                                          Future Work

                                          In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                          For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                          On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                          Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                          45

                                          Bibliography

                                          [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                          [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                          [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                          [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                          [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                          [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                          [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                          [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                          [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                          [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                          47

                                          BIBLIOGRAPHY

                                          [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                          [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                          [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                          [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                          [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                          [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                          [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                          [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                          [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                          [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                          [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                          48

                                          BIBLIOGRAPHY

                                          [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                          [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                          [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                          [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                          [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                          [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                          [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                          [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                          [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                          [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                          [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                          [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                          49

                                          BIBLIOGRAPHY

                                          models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                          [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                          [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                          [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                          [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                          [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                          [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                          [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                          [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                          [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                          [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                          50

                                          BIBLIOGRAPHY

                                          [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                          [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                          [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                          [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                          [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                          [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                          [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                          51

                                          TRITA TRITA-ITM-EX 2019606

                                          wwwkthse

                                          • Introduction
                                            • Background
                                            • Problem Description
                                            • Purpose Definitions amp Research Questions
                                            • Scope and Delimitations
                                            • Method Description
                                              • Frame of Reference
                                                • Filtration amp Clogging Indicators
                                                  • Basket Filter
                                                  • Self-Cleaning Basket Filters
                                                  • Manometer
                                                  • The Clogging Phenomena
                                                  • Physics-based Modelling
                                                    • Predictive Analytics
                                                      • Classification Error Metrics
                                                      • Regression Error Metrics
                                                      • Stochastic Time Series Models
                                                        • Neural Networks
                                                          • Overview
                                                          • The Perceptron
                                                          • Activation functions
                                                          • Neural Network Architectures
                                                              • Experimental Development
                                                                • Data Gathering and Processing
                                                                • Model Generation
                                                                  • Regression Processing with the LSTM Model
                                                                  • Regression Processing with the CNN Model
                                                                  • Label Classification
                                                                    • Model evaluation
                                                                    • Hardware Specifications
                                                                      • Results
                                                                        • LSTM Performance
                                                                        • CNN Performance
                                                                          • Discussion amp Conclusion
                                                                            • The LSTM Network
                                                                              • Regression Analysis
                                                                              • Classification Analysis
                                                                                • The CNN
                                                                                  • Regression Analysis
                                                                                  • Classification Analysis
                                                                                    • Comparison Between Both Networks
                                                                                    • Conclusion
                                                                                      • Future Work
                                                                                      • Bibliography

                                            CHAPTER 2 FRAME OF REFERENCE

                                            properties The result of this is a range of different hidden layers with varyingcharacteristics The use and configurations of these hidden layers in turn dependon what the NN should be able to achieve

                                            232 The PerceptronThe simplest neuron is the perceptron The perceptron takes several binary inputsfrom the input layer to create a single binary output What a perceptron outputs isbased on the weighted sum of the perceptrons inputs and respective weight as wellas individual bias There is a corresponding weight to every input that determineshow important the input is for the output of the perceptron Meanwhile the biasdictates how easy it is for the perceptron to output either a 0 or a 1 These conditionsgive the rule of the perceptron as [32]

                                            output =

                                            0 if w middot x+ b le 01 if w middot x+ b gt 0

                                            (220)

                                            In the above equation x is the input vector w the weight vector and b is theperceptronrsquos individual bias

                                            233 Activation functionsThe output of a neuron is determined by an activation function For the perceptronthe activation function is a simple step function as can be seen in Equation 220The step function is the simplest of activation function and is only able to producestrictly binary results The activation function is used for classification of linearlyseparable data in single-layer perceptrons like the one in Equation 220 Its binarynature means that a small change in weight or bias can flip the output of the neuroncompletely resulting in false classification Furthermore as most networks consistof either multiple perceptrons in a layer or multiple layers the data will not belinearly separable thus the step function will not properly separate and classifythe input data The separation problem is solved by training using backpropagationwhich requires a differentiable activation function something that the step functionis also unable of fulfilling

                                            Sigmoid Function

                                            The sigmoid function emerged as a solution to the flaws of the step function Thesigmoid function can take any value between 0 an 1 and is determined by

                                            f(z) = σ(z) = 11 + eminusz

                                            (221)

                                            for

                                            z =sum

                                            j

                                            wj middot xj + b (222)

                                            16

                                            23 NEURAL NETWORKS

                                            Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

                                            Rectified Function

                                            The rectifier activation function is defined as the positive part of its argument [34]

                                            f(x) = x+ = max(0 x) (223)

                                            for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

                                            Swish Function

                                            Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

                                            f(x) = x middot sigmoid(βx) (224)

                                            where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

                                            234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

                                            Shallow Neural Networks (SNN)

                                            SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

                                            17

                                            CHAPTER 2 FRAME OF REFERENCE

                                            tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

                                            Deep Neural Networks (DNN)

                                            DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

                                            f(x) = f (1) + f (2) + + f (n) (225)

                                            where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

                                            Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

                                            Recurring Neural Networks(RNN)

                                            Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

                                            x1 =[0 0 1 1 0 0 0

                                            ]x2 =

                                            [0 0 0 1 1 0 0

                                            ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

                                            18

                                            23 NEURAL NETWORKS

                                            weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

                                            Long Short Term Memory (LSTM) Networks

                                            In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

                                            it = σ(ωi

                                            [htminus1 xt

                                            ]+ bi)

                                            ot = σ(ωo

                                            [htminus1 xt

                                            ]+ bo)

                                            ft = σ(ωf

                                            [htminus1 xt

                                            ]+ bf )

                                            (226)

                                            The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

                                            Gated Recurrent Units (GRU)

                                            GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

                                            19

                                            CHAPTER 2 FRAME OF REFERENCE

                                            Convolutional Neural Networks (CNN)

                                            The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                                            The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                                            Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                                            Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                                            20

                                            23 NEURAL NETWORKS

                                            Figure 25 A max pooling layer with pool size 2 pooling an input

                                            The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                                            Figure 26 A flattening layer flattening the feature map

                                            21

                                            Chapter 3

                                            Experimental Development

                                            This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                                            31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                                            Figure 31 A complete test cycle

                                            23

                                            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                            During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                                            Figure 32 A test cycle with the backflush stop cut from the data

                                            The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                                            24

                                            31 DATA GATHERING AND PROCESSING

                                            Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                                            Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                                            Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                                            As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                                            25

                                            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                            the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                                            Table 31 Amount of data available after preprocessing

                                            Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                                            Total 3195 1012 2903

                                            When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                                            32 Model Generation

                                            In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                                            Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                                            The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                                            26

                                            32 MODEL GENERATION

                                            variables The encoding can be done for both integers and tags such as123

                                            rarr1 0 0

                                            0 1 00 0 1

                                            or

                                            redbluegreen

                                            rarr1 0 0

                                            0 1 00 0 1

                                            so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                                            The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                                            xi minusmin(x)max(x)minusmin(x) (31)

                                            Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                                            321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                                            X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                                            ](32)

                                            X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                                            ](33)

                                            27

                                            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                            When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                                            bull Samples - The amount of data points

                                            bull Time steps - The points of observation of the samples

                                            bull Features - The observed variables per time step

                                            The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                                            Figure 35 An overview of the LSTM network architecture

                                            The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                                            322 Regression Processing with the CNN Model

                                            As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                                            28

                                            32 MODEL GENERATION

                                            observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                            The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                            Figure 36 An overview of the CNN architecture

                                            Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                            323 Label Classification

                                            With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                            For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                            29

                                            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                            20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                            For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                            33 Model evaluation

                                            During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                            For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                            For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                            30

                                            34 HARDWARE SPECIFICATIONS

                                            Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                            34 Hardware Specifications

                                            The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                            Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                            31

                                            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                            The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                            The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                            32

                                            Chapter 4

                                            Results

                                            This chapter presents the results for all the models presented in the previous chapter

                                            41 LSTM Performance

                                            Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                            Figure 41 MAE and MSE loss for the LSTM

                                            33

                                            CHAPTER 4 RESULTS

                                            Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                            Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                            Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                            34

                                            41 LSTM PERFORMANCE

                                            Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                            Table 41 Evaluation metrics for the LSTM during regression analysis

                                            Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                            Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                            Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                            35

                                            CHAPTER 4 RESULTS

                                            Table 42 Evaluation metrics for the LSTM during classification analysis

                                            of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                            Table 43 LSTM confusion matrix

                                            PredictionLabel 1 Label 2

                                            Act

                                            ual Label 1 109 1

                                            Label 2 3 669

                                            42 CNN Performance

                                            Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                            Figure 47 MAE and MSE loss for the CNN

                                            36

                                            42 CNN PERFORMANCE

                                            Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                            Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                            Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                            37

                                            CHAPTER 4 RESULTS

                                            Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                            Table 44 Evaluation metrics for the CNN during regression analysis

                                            Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                            Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                            Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                            38

                                            42 CNN PERFORMANCE

                                            Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                            Table 45 Evaluation metrics for the CNN during classification analysis

                                            Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                            Table 46 CNN confusion matrix for data from the MAE regression network

                                            PredictionLabel 1 Label 2

                                            Act

                                            ual Label 1 82 29

                                            Label 2 38 631

                                            Table 47 CNN confusion matrix for data from the MSE regression network

                                            PredictionLabel 1 Label 2

                                            Act

                                            ual Label 1 69 41

                                            Label 2 11 659

                                            39

                                            Chapter 5

                                            Discussion amp Conclusion

                                            This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                            51 The LSTM Network

                                            511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                            Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                            The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                            41

                                            CHAPTER 5 DISCUSSION amp CONCLUSION

                                            while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                            512 Classification Analysis

                                            As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                            The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                            52 The CNN

                                            521 Regression Analysis

                                            The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                            Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                            42

                                            52 THE CNN

                                            is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                            Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                            522 Classification Analysis

                                            Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                            Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                            However the CNN is still doing a good job at predicting future clogging even

                                            43

                                            CHAPTER 5 DISCUSSION amp CONCLUSION

                                            up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                            53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                            54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                            As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                            44

                                            Chapter 6

                                            Future Work

                                            In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                            For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                            On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                            Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                            45

                                            Bibliography

                                            [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                            [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                            [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                            [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                            [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                            [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                            [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                            [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                            [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                            [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                            47

                                            BIBLIOGRAPHY

                                            [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                            [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                            [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                            [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                            [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                            [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                            [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                            [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                            [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                            [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                            [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                            48

                                            BIBLIOGRAPHY

                                            [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                            [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                            [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                            [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                            [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                            [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                            [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                            [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                            [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                            [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                            [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                            [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                            49

                                            BIBLIOGRAPHY

                                            models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                            [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                            [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                            [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                            [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                            [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                            [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                            [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                            [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                            [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                            [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                            50

                                            BIBLIOGRAPHY

                                            [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                            [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                            [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                            [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                            [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                            [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                            [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                            51

                                            TRITA TRITA-ITM-EX 2019606

                                            wwwkthse

                                            • Introduction
                                              • Background
                                              • Problem Description
                                              • Purpose Definitions amp Research Questions
                                              • Scope and Delimitations
                                              • Method Description
                                                • Frame of Reference
                                                  • Filtration amp Clogging Indicators
                                                    • Basket Filter
                                                    • Self-Cleaning Basket Filters
                                                    • Manometer
                                                    • The Clogging Phenomena
                                                    • Physics-based Modelling
                                                      • Predictive Analytics
                                                        • Classification Error Metrics
                                                        • Regression Error Metrics
                                                        • Stochastic Time Series Models
                                                          • Neural Networks
                                                            • Overview
                                                            • The Perceptron
                                                            • Activation functions
                                                            • Neural Network Architectures
                                                                • Experimental Development
                                                                  • Data Gathering and Processing
                                                                  • Model Generation
                                                                    • Regression Processing with the LSTM Model
                                                                    • Regression Processing with the CNN Model
                                                                    • Label Classification
                                                                      • Model evaluation
                                                                      • Hardware Specifications
                                                                        • Results
                                                                          • LSTM Performance
                                                                          • CNN Performance
                                                                            • Discussion amp Conclusion
                                                                              • The LSTM Network
                                                                                • Regression Analysis
                                                                                • Classification Analysis
                                                                                  • The CNN
                                                                                    • Regression Analysis
                                                                                    • Classification Analysis
                                                                                      • Comparison Between Both Networks
                                                                                      • Conclusion
                                                                                        • Future Work
                                                                                        • Bibliography

                                              23 NEURAL NETWORKS

                                              Only by using the sigmoid function as activation function outputs can be properlyand accurately estimated for classification of probabilities in deep neural nets [33]however the sigmoid function is not flawless and an issue that arises with the usageof it as an activation function is the vanishing gradient problem that is furtherdiscussed in section 234

                                              Rectified Function

                                              The rectifier activation function is defined as the positive part of its argument [34]

                                              f(x) = x+ = max(0 x) (223)

                                              for a neuron input x In comparison to the earlier activation functions a unitutilising the rectified activation function (also known as a rectified linear unit or aReLU unit) is more computationally efficient However because of the structure ofthe ReLU unit it cannot process inputs that are either negative or that approachzero also known as the dying ReLU problem [34]

                                              Swish Function

                                              Proposed in Ramachandran et al [35] is a replacement for ReLU called the Swishfunction The Swish function activates a neuron through

                                              f(x) = x middot sigmoid(βx) (224)

                                              where β is a trainable parameter or simply a constant Swish has proved to improvethe classification accuracy on widely used datasets such as ImageNet and MobileNASNet-A by 09 and 06 respectively [35] However results have also shownthat the Swish in comparison to other activation functions has a severely increasedtraining time and that it occasionally performs worse in terms of accuracy [36]

                                              234 Neural Network ArchitecturesThe following section presents an overview of various existing types of neural net-works what they are primarily used for and their individual differences

                                              Shallow Neural Networks (SNN)

                                              SNNs are the first generation of NNs and contain only one hidden layer The hiddenlayer is fully connected eg they connect every neuron in one layer (the input layer)to every neuron in another layer (the output layer) Two varieties of SNNs existthe feed forward neural network (FF) and the radial basis network (RBF) An FFuses the sigmoid function as its activation function while the RBF uses the radialbasis function which estimates how far off the estimation is from the actual targetTo further distinguish the two FFs are mainly used for classification and decisionmaking whereas RBFs are typically used for function approximation and machine

                                              17

                                              CHAPTER 2 FRAME OF REFERENCE

                                              tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

                                              Deep Neural Networks (DNN)

                                              DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

                                              f(x) = f (1) + f (2) + + f (n) (225)

                                              where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

                                              Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

                                              Recurring Neural Networks(RNN)

                                              Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

                                              x1 =[0 0 1 1 0 0 0

                                              ]x2 =

                                              [0 0 0 1 1 0 0

                                              ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

                                              18

                                              23 NEURAL NETWORKS

                                              weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

                                              Long Short Term Memory (LSTM) Networks

                                              In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

                                              it = σ(ωi

                                              [htminus1 xt

                                              ]+ bi)

                                              ot = σ(ωo

                                              [htminus1 xt

                                              ]+ bo)

                                              ft = σ(ωf

                                              [htminus1 xt

                                              ]+ bf )

                                              (226)

                                              The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

                                              Gated Recurrent Units (GRU)

                                              GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

                                              19

                                              CHAPTER 2 FRAME OF REFERENCE

                                              Convolutional Neural Networks (CNN)

                                              The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                                              The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                                              Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                                              Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                                              20

                                              23 NEURAL NETWORKS

                                              Figure 25 A max pooling layer with pool size 2 pooling an input

                                              The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                                              Figure 26 A flattening layer flattening the feature map

                                              21

                                              Chapter 3

                                              Experimental Development

                                              This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                                              31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                                              Figure 31 A complete test cycle

                                              23

                                              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                              During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                                              Figure 32 A test cycle with the backflush stop cut from the data

                                              The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                                              24

                                              31 DATA GATHERING AND PROCESSING

                                              Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                                              Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                                              Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                                              As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                                              25

                                              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                              the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                                              Table 31 Amount of data available after preprocessing

                                              Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                                              Total 3195 1012 2903

                                              When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                                              32 Model Generation

                                              In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                                              Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                                              The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                                              26

                                              32 MODEL GENERATION

                                              variables The encoding can be done for both integers and tags such as123

                                              rarr1 0 0

                                              0 1 00 0 1

                                              or

                                              redbluegreen

                                              rarr1 0 0

                                              0 1 00 0 1

                                              so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                                              The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                                              xi minusmin(x)max(x)minusmin(x) (31)

                                              Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                                              321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                                              X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                                              ](32)

                                              X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                                              ](33)

                                              27

                                              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                              When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                                              bull Samples - The amount of data points

                                              bull Time steps - The points of observation of the samples

                                              bull Features - The observed variables per time step

                                              The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                                              Figure 35 An overview of the LSTM network architecture

                                              The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                                              322 Regression Processing with the CNN Model

                                              As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                                              28

                                              32 MODEL GENERATION

                                              observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                              The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                              Figure 36 An overview of the CNN architecture

                                              Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                              323 Label Classification

                                              With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                              For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                              29

                                              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                              20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                              For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                              33 Model evaluation

                                              During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                              For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                              For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                              30

                                              34 HARDWARE SPECIFICATIONS

                                              Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                              34 Hardware Specifications

                                              The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                              Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                              31

                                              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                              The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                              The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                              32

                                              Chapter 4

                                              Results

                                              This chapter presents the results for all the models presented in the previous chapter

                                              41 LSTM Performance

                                              Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                              Figure 41 MAE and MSE loss for the LSTM

                                              33

                                              CHAPTER 4 RESULTS

                                              Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                              Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                              Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                              34

                                              41 LSTM PERFORMANCE

                                              Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                              Table 41 Evaluation metrics for the LSTM during regression analysis

                                              Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                              Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                              Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                              35

                                              CHAPTER 4 RESULTS

                                              Table 42 Evaluation metrics for the LSTM during classification analysis

                                              of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                              Table 43 LSTM confusion matrix

                                              PredictionLabel 1 Label 2

                                              Act

                                              ual Label 1 109 1

                                              Label 2 3 669

                                              42 CNN Performance

                                              Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                              Figure 47 MAE and MSE loss for the CNN

                                              36

                                              42 CNN PERFORMANCE

                                              Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                              Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                              Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                              37

                                              CHAPTER 4 RESULTS

                                              Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                              Table 44 Evaluation metrics for the CNN during regression analysis

                                              Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                              Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                              Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                              38

                                              42 CNN PERFORMANCE

                                              Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                              Table 45 Evaluation metrics for the CNN during classification analysis

                                              Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                              Table 46 CNN confusion matrix for data from the MAE regression network

                                              PredictionLabel 1 Label 2

                                              Act

                                              ual Label 1 82 29

                                              Label 2 38 631

                                              Table 47 CNN confusion matrix for data from the MSE regression network

                                              PredictionLabel 1 Label 2

                                              Act

                                              ual Label 1 69 41

                                              Label 2 11 659

                                              39

                                              Chapter 5

                                              Discussion amp Conclusion

                                              This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                              51 The LSTM Network

                                              511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                              Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                              The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                              41

                                              CHAPTER 5 DISCUSSION amp CONCLUSION

                                              while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                              512 Classification Analysis

                                              As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                              The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                              52 The CNN

                                              521 Regression Analysis

                                              The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                              Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                              42

                                              52 THE CNN

                                              is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                              Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                              522 Classification Analysis

                                              Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                              Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                              However the CNN is still doing a good job at predicting future clogging even

                                              43

                                              CHAPTER 5 DISCUSSION amp CONCLUSION

                                              up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                              53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                              54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                              As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                              44

                                              Chapter 6

                                              Future Work

                                              In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                              For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                              On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                              Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                              45

                                              Bibliography

                                              [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                              [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                              [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                              [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                              [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                              [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                              [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                              [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                              [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                              [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                              47

                                              BIBLIOGRAPHY

                                              [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                              [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                              [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                              [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                              [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                              [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                              [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                              [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                              [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                              [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                              [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                              48

                                              BIBLIOGRAPHY

                                              [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                              [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                              [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                              [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                              [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                              [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                              [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                              [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                              [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                              [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                              [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                              [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                              49

                                              BIBLIOGRAPHY

                                              models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                              [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                              [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                              [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                              [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                              [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                              [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                              [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                              [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                              [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                              [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                              50

                                              BIBLIOGRAPHY

                                              [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                              [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                              [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                              [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                              [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                              [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                              [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                              51

                                              TRITA TRITA-ITM-EX 2019606

                                              wwwkthse

                                              • Introduction
                                                • Background
                                                • Problem Description
                                                • Purpose Definitions amp Research Questions
                                                • Scope and Delimitations
                                                • Method Description
                                                  • Frame of Reference
                                                    • Filtration amp Clogging Indicators
                                                      • Basket Filter
                                                      • Self-Cleaning Basket Filters
                                                      • Manometer
                                                      • The Clogging Phenomena
                                                      • Physics-based Modelling
                                                        • Predictive Analytics
                                                          • Classification Error Metrics
                                                          • Regression Error Metrics
                                                          • Stochastic Time Series Models
                                                            • Neural Networks
                                                              • Overview
                                                              • The Perceptron
                                                              • Activation functions
                                                              • Neural Network Architectures
                                                                  • Experimental Development
                                                                    • Data Gathering and Processing
                                                                    • Model Generation
                                                                      • Regression Processing with the LSTM Model
                                                                      • Regression Processing with the CNN Model
                                                                      • Label Classification
                                                                        • Model evaluation
                                                                        • Hardware Specifications
                                                                          • Results
                                                                            • LSTM Performance
                                                                            • CNN Performance
                                                                              • Discussion amp Conclusion
                                                                                • The LSTM Network
                                                                                  • Regression Analysis
                                                                                  • Classification Analysis
                                                                                    • The CNN
                                                                                      • Regression Analysis
                                                                                      • Classification Analysis
                                                                                        • Comparison Between Both Networks
                                                                                        • Conclusion
                                                                                          • Future Work
                                                                                          • Bibliography

                                                CHAPTER 2 FRAME OF REFERENCE

                                                tuning [37 38] As for data preparation an SNN input layer is restricted in strictlyprocessing 2D-arrays as input data

                                                Deep Neural Networks (DNN)

                                                DNNs contains more hidden layers in comparison to an SNN Explained in Goodfel-low et al [39] the multiple layers of the network can be thought of as approximatinga main function using multiple functions

                                                f(x) = f (1) + f (2) + + f (n) (225)

                                                where each function represents a layer and they all together describe a processThe purpose of adding additional layers is to break up the main function in manyfunctions so that certain functions do not have to be all descriptive but insteadonly take into consideration certain behaviour The proposed strategy by addingup towards thousands of layers has proved to continuously improve performanceof DNNs as presented by Zagoruyko et al [40] However adding layers does notnecessarily have to mean that the obtained performance is the best in terms of ac-curacy and efficiency as the same paper shows better results for many benchmarkdatasets using only 16 layers The same contradiction is explored and validated byBa et al [41] The reason for these contradictory results is unclear but the resultssuggests that the strength of deep learning (and DNNs) may be because of wellmatched architectures and existing training procedures of the deep networks

                                                Mentioned in section 231 is that configurations of the hidden layers depend onthe objective of the NN The statement is partially proven true for the differencesin utilization of the two SNNs presented in section 234 and is further completedby a number of NN-configurations

                                                Recurring Neural Networks(RNN)

                                                Unlike the neurons in SNNs and DNNs an RNN has neurons which are state-basedThese state-based neurons allow the neurons to feed information from the previouspass of data to themselves The keeping of previous informations allows the networkto process and evaluate the context of how the information sent through the networkis structured An example is the ability to differentiate the two vectors

                                                x1 =[0 0 1 1 0 0 0

                                                ]x2 =

                                                [0 0 0 1 1 0 0

                                                ]where x1 and x2 are structurally different Were the sequence of the elements inthe vectors to be ignored there would be no difference about them Looking atcontext and sequencing thereby enables the analysis of events and data over timeThe addition of state-based neurons leads to more weights being updated througheach iteration as the adjustment of each state-based neuron individually is also

                                                18

                                                23 NEURAL NETWORKS

                                                weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

                                                Long Short Term Memory (LSTM) Networks

                                                In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

                                                it = σ(ωi

                                                [htminus1 xt

                                                ]+ bi)

                                                ot = σ(ωo

                                                [htminus1 xt

                                                ]+ bo)

                                                ft = σ(ωf

                                                [htminus1 xt

                                                ]+ bf )

                                                (226)

                                                The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

                                                Gated Recurrent Units (GRU)

                                                GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

                                                19

                                                CHAPTER 2 FRAME OF REFERENCE

                                                Convolutional Neural Networks (CNN)

                                                The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                                                The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                                                Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                                                Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                                                20

                                                23 NEURAL NETWORKS

                                                Figure 25 A max pooling layer with pool size 2 pooling an input

                                                The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                                                Figure 26 A flattening layer flattening the feature map

                                                21

                                                Chapter 3

                                                Experimental Development

                                                This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                                                31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                                                Figure 31 A complete test cycle

                                                23

                                                CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                                                Figure 32 A test cycle with the backflush stop cut from the data

                                                The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                                                24

                                                31 DATA GATHERING AND PROCESSING

                                                Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                                                Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                                                Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                                                As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                                                25

                                                CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                                                Table 31 Amount of data available after preprocessing

                                                Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                                                Total 3195 1012 2903

                                                When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                                                32 Model Generation

                                                In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                                                Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                                                The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                                                26

                                                32 MODEL GENERATION

                                                variables The encoding can be done for both integers and tags such as123

                                                rarr1 0 0

                                                0 1 00 0 1

                                                or

                                                redbluegreen

                                                rarr1 0 0

                                                0 1 00 0 1

                                                so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                                                The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                                                xi minusmin(x)max(x)minusmin(x) (31)

                                                Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                                                321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                                                X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                                                ](32)

                                                X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                                                ](33)

                                                27

                                                CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                                                bull Samples - The amount of data points

                                                bull Time steps - The points of observation of the samples

                                                bull Features - The observed variables per time step

                                                The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                                                Figure 35 An overview of the LSTM network architecture

                                                The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                                                322 Regression Processing with the CNN Model

                                                As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                                                28

                                                32 MODEL GENERATION

                                                observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                                The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                                Figure 36 An overview of the CNN architecture

                                                Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                                323 Label Classification

                                                With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                                For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                                29

                                                CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                                For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                                33 Model evaluation

                                                During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                                For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                                For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                                30

                                                34 HARDWARE SPECIFICATIONS

                                                Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                                34 Hardware Specifications

                                                The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                                Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                                31

                                                CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                                The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                                32

                                                Chapter 4

                                                Results

                                                This chapter presents the results for all the models presented in the previous chapter

                                                41 LSTM Performance

                                                Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                                Figure 41 MAE and MSE loss for the LSTM

                                                33

                                                CHAPTER 4 RESULTS

                                                Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                34

                                                41 LSTM PERFORMANCE

                                                Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                Table 41 Evaluation metrics for the LSTM during regression analysis

                                                Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                                Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                                Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                                35

                                                CHAPTER 4 RESULTS

                                                Table 42 Evaluation metrics for the LSTM during classification analysis

                                                of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                                Table 43 LSTM confusion matrix

                                                PredictionLabel 1 Label 2

                                                Act

                                                ual Label 1 109 1

                                                Label 2 3 669

                                                42 CNN Performance

                                                Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                                Figure 47 MAE and MSE loss for the CNN

                                                36

                                                42 CNN PERFORMANCE

                                                Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                37

                                                CHAPTER 4 RESULTS

                                                Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                Table 44 Evaluation metrics for the CNN during regression analysis

                                                Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                38

                                                42 CNN PERFORMANCE

                                                Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                Table 45 Evaluation metrics for the CNN during classification analysis

                                                Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                Table 46 CNN confusion matrix for data from the MAE regression network

                                                PredictionLabel 1 Label 2

                                                Act

                                                ual Label 1 82 29

                                                Label 2 38 631

                                                Table 47 CNN confusion matrix for data from the MSE regression network

                                                PredictionLabel 1 Label 2

                                                Act

                                                ual Label 1 69 41

                                                Label 2 11 659

                                                39

                                                Chapter 5

                                                Discussion amp Conclusion

                                                This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                51 The LSTM Network

                                                511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                41

                                                CHAPTER 5 DISCUSSION amp CONCLUSION

                                                while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                512 Classification Analysis

                                                As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                52 The CNN

                                                521 Regression Analysis

                                                The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                42

                                                52 THE CNN

                                                is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                522 Classification Analysis

                                                Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                However the CNN is still doing a good job at predicting future clogging even

                                                43

                                                CHAPTER 5 DISCUSSION amp CONCLUSION

                                                up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                44

                                                Chapter 6

                                                Future Work

                                                In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                45

                                                Bibliography

                                                [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                47

                                                BIBLIOGRAPHY

                                                [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                48

                                                BIBLIOGRAPHY

                                                [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                49

                                                BIBLIOGRAPHY

                                                models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                50

                                                BIBLIOGRAPHY

                                                [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                51

                                                TRITA TRITA-ITM-EX 2019606

                                                wwwkthse

                                                • Introduction
                                                  • Background
                                                  • Problem Description
                                                  • Purpose Definitions amp Research Questions
                                                  • Scope and Delimitations
                                                  • Method Description
                                                    • Frame of Reference
                                                      • Filtration amp Clogging Indicators
                                                        • Basket Filter
                                                        • Self-Cleaning Basket Filters
                                                        • Manometer
                                                        • The Clogging Phenomena
                                                        • Physics-based Modelling
                                                          • Predictive Analytics
                                                            • Classification Error Metrics
                                                            • Regression Error Metrics
                                                            • Stochastic Time Series Models
                                                              • Neural Networks
                                                                • Overview
                                                                • The Perceptron
                                                                • Activation functions
                                                                • Neural Network Architectures
                                                                    • Experimental Development
                                                                      • Data Gathering and Processing
                                                                      • Model Generation
                                                                        • Regression Processing with the LSTM Model
                                                                        • Regression Processing with the CNN Model
                                                                        • Label Classification
                                                                          • Model evaluation
                                                                          • Hardware Specifications
                                                                            • Results
                                                                              • LSTM Performance
                                                                              • CNN Performance
                                                                                • Discussion amp Conclusion
                                                                                  • The LSTM Network
                                                                                    • Regression Analysis
                                                                                    • Classification Analysis
                                                                                      • The CNN
                                                                                        • Regression Analysis
                                                                                        • Classification Analysis
                                                                                          • Comparison Between Both Networks
                                                                                          • Conclusion
                                                                                            • Future Work
                                                                                            • Bibliography

                                                  23 NEURAL NETWORKS

                                                  weight dependent The updating of the weights depends on the activation functionand an ill chosen activation function could lead to the vanishing gradient problem[42] which occurs when the gradient becomes incredibly small thus preventing theneuron from updating its weights The result of this is a rapid loss in information asthe weights through time become saturated causing previous state information tobe of no informatory value Similar to the vanishing gradient problem there is alsothe exploding gradient problem where the gradient instead becomes so incrediblyhuge that the weights are impossible to adjust

                                                  Long Short Term Memory (LSTM) Networks

                                                  In an effort to combat the vanishingexploding gradient problem LSTM networkswere developed with a dedicated memory cell The memory cell has three gatesinput (it) output (ot) and forget (ft) Inclusion of these gates allows for safeguard-ing the information that passes through the network either by stopping or allowingthe flow of information through the cell The gates can be represented by Equation226 with activation function (σ) weights for each respective gates neurons (wx)previous LSTM-block output at previous time step (htminus1) input at current timestep (xt) and respective gate bias (bx) as

                                                  it = σ(ωi

                                                  [htminus1 xt

                                                  ]+ bi)

                                                  ot = σ(ωo

                                                  [htminus1 xt

                                                  ]+ bo)

                                                  ft = σ(ωf

                                                  [htminus1 xt

                                                  ]+ bf )

                                                  (226)

                                                  The input gate determines how much of the information from the previous layer thatgets stored in the cell The output gate determines how much the next layer getsto know about the state of the memory cell and the forget gate allows for completedismissal of information The need of forgetting information could be presented asodd at first but for sequencing it could be of value when learning something likea book When a new chapter begins it could be necessary to forget some of thecharacters from the previous chapter [43]

                                                  Gated Recurrent Units (GRU)

                                                  GRUs are similar to LSTMs but have one less gate and the remaining two gates actslightly differently The first gate of the GRU is the update gate which dictateshow much information is being kept from the past state and how much informationis being let in from the previous layer The second gate is a reset gate which does asimilar job as the forget gate in an LSTM Unlike LSTMs GRU cells always outputtheir full state meaning there is no limitation in the information being fed throughthe network Using GRU instead of LSTM has achieved faster convergence in bothparameter updating and generalization as well as convergence in CPU time on somesequencing datasets for polyphonic music data and raw speech [44]

                                                  19

                                                  CHAPTER 2 FRAME OF REFERENCE

                                                  Convolutional Neural Networks (CNN)

                                                  The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                                                  The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                                                  Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                                                  Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                                                  20

                                                  23 NEURAL NETWORKS

                                                  Figure 25 A max pooling layer with pool size 2 pooling an input

                                                  The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                                                  Figure 26 A flattening layer flattening the feature map

                                                  21

                                                  Chapter 3

                                                  Experimental Development

                                                  This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                                                  31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                                                  Figure 31 A complete test cycle

                                                  23

                                                  CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                  During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                                                  Figure 32 A test cycle with the backflush stop cut from the data

                                                  The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                                                  24

                                                  31 DATA GATHERING AND PROCESSING

                                                  Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                                                  Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                                                  Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                                                  As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                                                  25

                                                  CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                  the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                                                  Table 31 Amount of data available after preprocessing

                                                  Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                                                  Total 3195 1012 2903

                                                  When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                                                  32 Model Generation

                                                  In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                                                  Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                                                  The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                                                  26

                                                  32 MODEL GENERATION

                                                  variables The encoding can be done for both integers and tags such as123

                                                  rarr1 0 0

                                                  0 1 00 0 1

                                                  or

                                                  redbluegreen

                                                  rarr1 0 0

                                                  0 1 00 0 1

                                                  so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                                                  The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                                                  xi minusmin(x)max(x)minusmin(x) (31)

                                                  Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                                                  321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                                                  X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                                                  ](32)

                                                  X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                                                  ](33)

                                                  27

                                                  CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                  When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                                                  bull Samples - The amount of data points

                                                  bull Time steps - The points of observation of the samples

                                                  bull Features - The observed variables per time step

                                                  The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                                                  Figure 35 An overview of the LSTM network architecture

                                                  The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                                                  322 Regression Processing with the CNN Model

                                                  As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                                                  28

                                                  32 MODEL GENERATION

                                                  observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                                  The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                                  Figure 36 An overview of the CNN architecture

                                                  Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                                  323 Label Classification

                                                  With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                                  For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                                  29

                                                  CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                  20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                                  For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                                  33 Model evaluation

                                                  During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                                  For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                                  For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                                  30

                                                  34 HARDWARE SPECIFICATIONS

                                                  Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                                  34 Hardware Specifications

                                                  The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                                  Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                                  31

                                                  CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                  The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                                  The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                                  32

                                                  Chapter 4

                                                  Results

                                                  This chapter presents the results for all the models presented in the previous chapter

                                                  41 LSTM Performance

                                                  Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                                  Figure 41 MAE and MSE loss for the LSTM

                                                  33

                                                  CHAPTER 4 RESULTS

                                                  Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                  Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                  Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                  34

                                                  41 LSTM PERFORMANCE

                                                  Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                  Table 41 Evaluation metrics for the LSTM during regression analysis

                                                  Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                                  Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                                  Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                                  35

                                                  CHAPTER 4 RESULTS

                                                  Table 42 Evaluation metrics for the LSTM during classification analysis

                                                  of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                                  Table 43 LSTM confusion matrix

                                                  PredictionLabel 1 Label 2

                                                  Act

                                                  ual Label 1 109 1

                                                  Label 2 3 669

                                                  42 CNN Performance

                                                  Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                                  Figure 47 MAE and MSE loss for the CNN

                                                  36

                                                  42 CNN PERFORMANCE

                                                  Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                  Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                  Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                  37

                                                  CHAPTER 4 RESULTS

                                                  Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                  Table 44 Evaluation metrics for the CNN during regression analysis

                                                  Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                  Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                  Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                  38

                                                  42 CNN PERFORMANCE

                                                  Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                  Table 45 Evaluation metrics for the CNN during classification analysis

                                                  Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                  Table 46 CNN confusion matrix for data from the MAE regression network

                                                  PredictionLabel 1 Label 2

                                                  Act

                                                  ual Label 1 82 29

                                                  Label 2 38 631

                                                  Table 47 CNN confusion matrix for data from the MSE regression network

                                                  PredictionLabel 1 Label 2

                                                  Act

                                                  ual Label 1 69 41

                                                  Label 2 11 659

                                                  39

                                                  Chapter 5

                                                  Discussion amp Conclusion

                                                  This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                  51 The LSTM Network

                                                  511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                  Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                  The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                  41

                                                  CHAPTER 5 DISCUSSION amp CONCLUSION

                                                  while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                  512 Classification Analysis

                                                  As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                  The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                  52 The CNN

                                                  521 Regression Analysis

                                                  The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                  Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                  42

                                                  52 THE CNN

                                                  is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                  Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                  522 Classification Analysis

                                                  Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                  Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                  However the CNN is still doing a good job at predicting future clogging even

                                                  43

                                                  CHAPTER 5 DISCUSSION amp CONCLUSION

                                                  up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                  53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                  54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                  As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                  44

                                                  Chapter 6

                                                  Future Work

                                                  In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                  For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                  On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                  Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                  45

                                                  Bibliography

                                                  [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                  [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                  [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                  [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                  [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                  [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                  [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                  [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                  [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                  [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                  47

                                                  BIBLIOGRAPHY

                                                  [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                  [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                  [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                  [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                  [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                  [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                  [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                  [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                  [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                  [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                  [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                  48

                                                  BIBLIOGRAPHY

                                                  [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                  [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                  [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                  [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                  [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                  [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                  [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                  [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                  [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                  [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                  [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                  [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                  49

                                                  BIBLIOGRAPHY

                                                  models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                  [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                  [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                  [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                  [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                  [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                  [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                  [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                  [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                  [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                  [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                  50

                                                  BIBLIOGRAPHY

                                                  [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                  [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                  [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                  [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                  [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                  [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                  [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                  51

                                                  TRITA TRITA-ITM-EX 2019606

                                                  wwwkthse

                                                  • Introduction
                                                    • Background
                                                    • Problem Description
                                                    • Purpose Definitions amp Research Questions
                                                    • Scope and Delimitations
                                                    • Method Description
                                                      • Frame of Reference
                                                        • Filtration amp Clogging Indicators
                                                          • Basket Filter
                                                          • Self-Cleaning Basket Filters
                                                          • Manometer
                                                          • The Clogging Phenomena
                                                          • Physics-based Modelling
                                                            • Predictive Analytics
                                                              • Classification Error Metrics
                                                              • Regression Error Metrics
                                                              • Stochastic Time Series Models
                                                                • Neural Networks
                                                                  • Overview
                                                                  • The Perceptron
                                                                  • Activation functions
                                                                  • Neural Network Architectures
                                                                      • Experimental Development
                                                                        • Data Gathering and Processing
                                                                        • Model Generation
                                                                          • Regression Processing with the LSTM Model
                                                                          • Regression Processing with the CNN Model
                                                                          • Label Classification
                                                                            • Model evaluation
                                                                            • Hardware Specifications
                                                                              • Results
                                                                                • LSTM Performance
                                                                                • CNN Performance
                                                                                  • Discussion amp Conclusion
                                                                                    • The LSTM Network
                                                                                      • Regression Analysis
                                                                                      • Classification Analysis
                                                                                        • The CNN
                                                                                          • Regression Analysis
                                                                                          • Classification Analysis
                                                                                            • Comparison Between Both Networks
                                                                                            • Conclusion
                                                                                              • Future Work
                                                                                              • Bibliography

                                                    CHAPTER 2 FRAME OF REFERENCE

                                                    Convolutional Neural Networks (CNN)

                                                    The CNN structure was primarily inspired by the architecture and connectivitypattern of the animal visual cortex The visual cortex contains different sets of cellsand neurons that respond differently and individually to stimuli in the visual fieldLikewise the architecture of the CNN is designed to contain a special set of layersto allow the network to process restricted but overlapping parts of the data in sucha way that the entirety of the data is still processed This specifically allows thenetwork to process huge amounts of data by reducing the spatial size of the datasetwithout losing the features in the data

                                                    The input data are initially fed through a convolutional layer to perform the convo-lution operation The convolution operation employs a kernel which is a functionthat acts as a filter The kernel slides across the input data while continuouslyapplying the filter to the data effectively reducing the input data The input datathat have been processed by a convolution layer are known as a convolved feature

                                                    Figure 24 A kernel of size 3 sliding over a 1-dimensional convolutional layer

                                                    Following a convolutional layer is typically a pooling layer The pooling layer issimilar to the convolution layer in that it further reduces the size of the data byreducing the spatial size of the convolved feature In doing so the computationalpower required to process the data can be reduced as the dimensionality of the datadecreases Pooling can be done in two ways max pooling or average pooling If amax pooling-layer is applied the maximum value that is contained within the kernelwill be the returned value whereas for an average pooling-layer the average value ofall values within the kernel would be returned The nature of the max pooling allowsit to act as noise suppressant as it ignores the noisy activations by only extractingthe maximum value as well as removes the noise by reducing the dimensionality ofthe data For the average pooling the noise remains within the data however itreduces somewhat through the dimensionality reduction The removal of noise inthe data results in the max pooling to perform a lot better than average pooling[45]

                                                    20

                                                    23 NEURAL NETWORKS

                                                    Figure 25 A max pooling layer with pool size 2 pooling an input

                                                    The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                                                    Figure 26 A flattening layer flattening the feature map

                                                    21

                                                    Chapter 3

                                                    Experimental Development

                                                    This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                                                    31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                                                    Figure 31 A complete test cycle

                                                    23

                                                    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                    During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                                                    Figure 32 A test cycle with the backflush stop cut from the data

                                                    The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                                                    24

                                                    31 DATA GATHERING AND PROCESSING

                                                    Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                                                    Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                                                    Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                                                    As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                                                    25

                                                    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                    the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                                                    Table 31 Amount of data available after preprocessing

                                                    Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                                                    Total 3195 1012 2903

                                                    When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                                                    32 Model Generation

                                                    In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                                                    Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                                                    The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                                                    26

                                                    32 MODEL GENERATION

                                                    variables The encoding can be done for both integers and tags such as123

                                                    rarr1 0 0

                                                    0 1 00 0 1

                                                    or

                                                    redbluegreen

                                                    rarr1 0 0

                                                    0 1 00 0 1

                                                    so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                                                    The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                                                    xi minusmin(x)max(x)minusmin(x) (31)

                                                    Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                                                    321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                                                    X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                                                    ](32)

                                                    X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                                                    ](33)

                                                    27

                                                    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                    When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                                                    bull Samples - The amount of data points

                                                    bull Time steps - The points of observation of the samples

                                                    bull Features - The observed variables per time step

                                                    The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                                                    Figure 35 An overview of the LSTM network architecture

                                                    The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                                                    322 Regression Processing with the CNN Model

                                                    As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                                                    28

                                                    32 MODEL GENERATION

                                                    observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                                    The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                                    Figure 36 An overview of the CNN architecture

                                                    Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                                    323 Label Classification

                                                    With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                                    For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                                    29

                                                    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                    20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                                    For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                                    33 Model evaluation

                                                    During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                                    For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                                    For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                                    30

                                                    34 HARDWARE SPECIFICATIONS

                                                    Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                                    34 Hardware Specifications

                                                    The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                                    Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                                    31

                                                    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                    The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                                    The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                                    32

                                                    Chapter 4

                                                    Results

                                                    This chapter presents the results for all the models presented in the previous chapter

                                                    41 LSTM Performance

                                                    Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                                    Figure 41 MAE and MSE loss for the LSTM

                                                    33

                                                    CHAPTER 4 RESULTS

                                                    Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                    Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                    Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                    34

                                                    41 LSTM PERFORMANCE

                                                    Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                    Table 41 Evaluation metrics for the LSTM during regression analysis

                                                    Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                                    Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                                    Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                                    35

                                                    CHAPTER 4 RESULTS

                                                    Table 42 Evaluation metrics for the LSTM during classification analysis

                                                    of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                                    Table 43 LSTM confusion matrix

                                                    PredictionLabel 1 Label 2

                                                    Act

                                                    ual Label 1 109 1

                                                    Label 2 3 669

                                                    42 CNN Performance

                                                    Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                                    Figure 47 MAE and MSE loss for the CNN

                                                    36

                                                    42 CNN PERFORMANCE

                                                    Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                    Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                    Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                    37

                                                    CHAPTER 4 RESULTS

                                                    Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                    Table 44 Evaluation metrics for the CNN during regression analysis

                                                    Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                    Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                    Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                    38

                                                    42 CNN PERFORMANCE

                                                    Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                    Table 45 Evaluation metrics for the CNN during classification analysis

                                                    Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                    Table 46 CNN confusion matrix for data from the MAE regression network

                                                    PredictionLabel 1 Label 2

                                                    Act

                                                    ual Label 1 82 29

                                                    Label 2 38 631

                                                    Table 47 CNN confusion matrix for data from the MSE regression network

                                                    PredictionLabel 1 Label 2

                                                    Act

                                                    ual Label 1 69 41

                                                    Label 2 11 659

                                                    39

                                                    Chapter 5

                                                    Discussion amp Conclusion

                                                    This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                    51 The LSTM Network

                                                    511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                    Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                    The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                    41

                                                    CHAPTER 5 DISCUSSION amp CONCLUSION

                                                    while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                    512 Classification Analysis

                                                    As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                    The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                    52 The CNN

                                                    521 Regression Analysis

                                                    The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                    Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                    42

                                                    52 THE CNN

                                                    is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                    Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                    522 Classification Analysis

                                                    Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                    Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                    However the CNN is still doing a good job at predicting future clogging even

                                                    43

                                                    CHAPTER 5 DISCUSSION amp CONCLUSION

                                                    up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                    53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                    54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                    As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                    44

                                                    Chapter 6

                                                    Future Work

                                                    In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                    For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                    On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                    Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                    45

                                                    Bibliography

                                                    [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                    [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                    [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                    [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                    [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                    [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                    [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                    [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                    [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                    [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                    47

                                                    BIBLIOGRAPHY

                                                    [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                    [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                    [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                    [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                    [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                    [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                    [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                    [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                    [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                    [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                    [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                    48

                                                    BIBLIOGRAPHY

                                                    [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                    [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                    [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                    [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                    [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                    [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                    [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                    [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                    [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                    [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                    [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                    [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                    49

                                                    BIBLIOGRAPHY

                                                    models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                    [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                    [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                    [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                    [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                    [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                    [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                    [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                    [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                    [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                    [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                    50

                                                    BIBLIOGRAPHY

                                                    [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                    [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                    [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                    [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                    [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                    [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                    [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                    51

                                                    TRITA TRITA-ITM-EX 2019606

                                                    wwwkthse

                                                    • Introduction
                                                      • Background
                                                      • Problem Description
                                                      • Purpose Definitions amp Research Questions
                                                      • Scope and Delimitations
                                                      • Method Description
                                                        • Frame of Reference
                                                          • Filtration amp Clogging Indicators
                                                            • Basket Filter
                                                            • Self-Cleaning Basket Filters
                                                            • Manometer
                                                            • The Clogging Phenomena
                                                            • Physics-based Modelling
                                                              • Predictive Analytics
                                                                • Classification Error Metrics
                                                                • Regression Error Metrics
                                                                • Stochastic Time Series Models
                                                                  • Neural Networks
                                                                    • Overview
                                                                    • The Perceptron
                                                                    • Activation functions
                                                                    • Neural Network Architectures
                                                                        • Experimental Development
                                                                          • Data Gathering and Processing
                                                                          • Model Generation
                                                                            • Regression Processing with the LSTM Model
                                                                            • Regression Processing with the CNN Model
                                                                            • Label Classification
                                                                              • Model evaluation
                                                                              • Hardware Specifications
                                                                                • Results
                                                                                  • LSTM Performance
                                                                                  • CNN Performance
                                                                                    • Discussion amp Conclusion
                                                                                      • The LSTM Network
                                                                                        • Regression Analysis
                                                                                        • Classification Analysis
                                                                                          • The CNN
                                                                                            • Regression Analysis
                                                                                            • Classification Analysis
                                                                                              • Comparison Between Both Networks
                                                                                              • Conclusion
                                                                                                • Future Work
                                                                                                • Bibliography

                                                      23 NEURAL NETWORKS

                                                      Figure 25 A max pooling layer with pool size 2 pooling an input

                                                      The procedure of passing the data through a convolutional layer and pooling layercan be done repeatedly to reduce the dimensionality of the data even more Lastlyhowever before feeding the data into the neurons of the neural network the pooledfeature map is flattened in the flattening layer This results in a significantly reducedinput array in comparison to the original data and is the primary reason that CNNsare used for multiple purposes such as image recognition and classification [46]natural language processing [47] and time series analysis [48]

                                                      Figure 26 A flattening layer flattening the feature map

                                                      21

                                                      Chapter 3

                                                      Experimental Development

                                                      This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                                                      31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                                                      Figure 31 A complete test cycle

                                                      23

                                                      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                      During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                                                      Figure 32 A test cycle with the backflush stop cut from the data

                                                      The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                                                      24

                                                      31 DATA GATHERING AND PROCESSING

                                                      Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                                                      Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                                                      Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                                                      As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                                                      25

                                                      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                      the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                                                      Table 31 Amount of data available after preprocessing

                                                      Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                                                      Total 3195 1012 2903

                                                      When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                                                      32 Model Generation

                                                      In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                                                      Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                                                      The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                                                      26

                                                      32 MODEL GENERATION

                                                      variables The encoding can be done for both integers and tags such as123

                                                      rarr1 0 0

                                                      0 1 00 0 1

                                                      or

                                                      redbluegreen

                                                      rarr1 0 0

                                                      0 1 00 0 1

                                                      so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                                                      The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                                                      xi minusmin(x)max(x)minusmin(x) (31)

                                                      Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                                                      321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                                                      X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                                                      ](32)

                                                      X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                                                      ](33)

                                                      27

                                                      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                      When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                                                      bull Samples - The amount of data points

                                                      bull Time steps - The points of observation of the samples

                                                      bull Features - The observed variables per time step

                                                      The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                                                      Figure 35 An overview of the LSTM network architecture

                                                      The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                                                      322 Regression Processing with the CNN Model

                                                      As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                                                      28

                                                      32 MODEL GENERATION

                                                      observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                                      The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                                      Figure 36 An overview of the CNN architecture

                                                      Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                                      323 Label Classification

                                                      With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                                      For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                                      29

                                                      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                      20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                                      For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                                      33 Model evaluation

                                                      During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                                      For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                                      For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                                      30

                                                      34 HARDWARE SPECIFICATIONS

                                                      Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                                      34 Hardware Specifications

                                                      The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                                      Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                                      31

                                                      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                      The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                                      The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                                      32

                                                      Chapter 4

                                                      Results

                                                      This chapter presents the results for all the models presented in the previous chapter

                                                      41 LSTM Performance

                                                      Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                                      Figure 41 MAE and MSE loss for the LSTM

                                                      33

                                                      CHAPTER 4 RESULTS

                                                      Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                      Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                      Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                      34

                                                      41 LSTM PERFORMANCE

                                                      Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                      Table 41 Evaluation metrics for the LSTM during regression analysis

                                                      Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                                      Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                                      Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                                      35

                                                      CHAPTER 4 RESULTS

                                                      Table 42 Evaluation metrics for the LSTM during classification analysis

                                                      of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                                      Table 43 LSTM confusion matrix

                                                      PredictionLabel 1 Label 2

                                                      Act

                                                      ual Label 1 109 1

                                                      Label 2 3 669

                                                      42 CNN Performance

                                                      Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                                      Figure 47 MAE and MSE loss for the CNN

                                                      36

                                                      42 CNN PERFORMANCE

                                                      Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                      Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                      Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                      37

                                                      CHAPTER 4 RESULTS

                                                      Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                      Table 44 Evaluation metrics for the CNN during regression analysis

                                                      Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                      Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                      Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                      38

                                                      42 CNN PERFORMANCE

                                                      Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                      Table 45 Evaluation metrics for the CNN during classification analysis

                                                      Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                      Table 46 CNN confusion matrix for data from the MAE regression network

                                                      PredictionLabel 1 Label 2

                                                      Act

                                                      ual Label 1 82 29

                                                      Label 2 38 631

                                                      Table 47 CNN confusion matrix for data from the MSE regression network

                                                      PredictionLabel 1 Label 2

                                                      Act

                                                      ual Label 1 69 41

                                                      Label 2 11 659

                                                      39

                                                      Chapter 5

                                                      Discussion amp Conclusion

                                                      This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                      51 The LSTM Network

                                                      511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                      Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                      The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                      41

                                                      CHAPTER 5 DISCUSSION amp CONCLUSION

                                                      while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                      512 Classification Analysis

                                                      As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                      The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                      52 The CNN

                                                      521 Regression Analysis

                                                      The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                      Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                      42

                                                      52 THE CNN

                                                      is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                      Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                      522 Classification Analysis

                                                      Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                      Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                      However the CNN is still doing a good job at predicting future clogging even

                                                      43

                                                      CHAPTER 5 DISCUSSION amp CONCLUSION

                                                      up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                      53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                      54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                      As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                      44

                                                      Chapter 6

                                                      Future Work

                                                      In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                      For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                      On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                      Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                      45

                                                      Bibliography

                                                      [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                      [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                      [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                      [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                      [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                      [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                      [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                      [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                      [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                      [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                      47

                                                      BIBLIOGRAPHY

                                                      [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                      [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                      [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                      [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                      [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                      [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                      [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                      [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                      [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                      [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                      [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                      48

                                                      BIBLIOGRAPHY

                                                      [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                      [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                      [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                      [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                      [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                      [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                      [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                      [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                      [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                      [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                      [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                      [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                      49

                                                      BIBLIOGRAPHY

                                                      models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                      [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                      [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                      [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                      [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                      [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                      [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                      [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                      [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                      [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                      [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                      50

                                                      BIBLIOGRAPHY

                                                      [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                      [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                      [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                      [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                      [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                      [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                      [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                      51

                                                      TRITA TRITA-ITM-EX 2019606

                                                      wwwkthse

                                                      • Introduction
                                                        • Background
                                                        • Problem Description
                                                        • Purpose Definitions amp Research Questions
                                                        • Scope and Delimitations
                                                        • Method Description
                                                          • Frame of Reference
                                                            • Filtration amp Clogging Indicators
                                                              • Basket Filter
                                                              • Self-Cleaning Basket Filters
                                                              • Manometer
                                                              • The Clogging Phenomena
                                                              • Physics-based Modelling
                                                                • Predictive Analytics
                                                                  • Classification Error Metrics
                                                                  • Regression Error Metrics
                                                                  • Stochastic Time Series Models
                                                                    • Neural Networks
                                                                      • Overview
                                                                      • The Perceptron
                                                                      • Activation functions
                                                                      • Neural Network Architectures
                                                                          • Experimental Development
                                                                            • Data Gathering and Processing
                                                                            • Model Generation
                                                                              • Regression Processing with the LSTM Model
                                                                              • Regression Processing with the CNN Model
                                                                              • Label Classification
                                                                                • Model evaluation
                                                                                • Hardware Specifications
                                                                                  • Results
                                                                                    • LSTM Performance
                                                                                    • CNN Performance
                                                                                      • Discussion amp Conclusion
                                                                                        • The LSTM Network
                                                                                          • Regression Analysis
                                                                                          • Classification Analysis
                                                                                            • The CNN
                                                                                              • Regression Analysis
                                                                                              • Classification Analysis
                                                                                                • Comparison Between Both Networks
                                                                                                • Conclusion
                                                                                                  • Future Work
                                                                                                  • Bibliography

                                                        Chapter 3

                                                        Experimental Development

                                                        This chapter aims to present the pre-processing of the data and how the experimen-tal platform was set-up

                                                        31 Data Gathering and ProcessingThe filtration data were obtained from filter tests done by Alfa Laval at the lakeMalmasjon over a span of 2 weeks A total of 11 test cycles were recorded Datawere gathered for the duration of a complete test cycle of the filter which lastedfor a runtime of at least 40 minutes Each data point was sampled every 5 secondsand contains sensor data for the differential pressure over the filter the fluid flowin the entire system the fluid pressure in the entire system and the fluid flow inthe backflush mechanism All data were then stored in Alfa Lavals cloud serviceConnectivity

                                                        Figure 31 A complete test cycle

                                                        23

                                                        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                        During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                                                        Figure 32 A test cycle with the backflush stop cut from the data

                                                        The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                                                        24

                                                        31 DATA GATHERING AND PROCESSING

                                                        Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                                                        Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                                                        Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                                                        As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                                                        25

                                                        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                        the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                                                        Table 31 Amount of data available after preprocessing

                                                        Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                                                        Total 3195 1012 2903

                                                        When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                                                        32 Model Generation

                                                        In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                                                        Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                                                        The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                                                        26

                                                        32 MODEL GENERATION

                                                        variables The encoding can be done for both integers and tags such as123

                                                        rarr1 0 0

                                                        0 1 00 0 1

                                                        or

                                                        redbluegreen

                                                        rarr1 0 0

                                                        0 1 00 0 1

                                                        so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                                                        The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                                                        xi minusmin(x)max(x)minusmin(x) (31)

                                                        Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                                                        321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                                                        X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                                                        ](32)

                                                        X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                                                        ](33)

                                                        27

                                                        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                        When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                                                        bull Samples - The amount of data points

                                                        bull Time steps - The points of observation of the samples

                                                        bull Features - The observed variables per time step

                                                        The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                                                        Figure 35 An overview of the LSTM network architecture

                                                        The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                                                        322 Regression Processing with the CNN Model

                                                        As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                                                        28

                                                        32 MODEL GENERATION

                                                        observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                                        The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                                        Figure 36 An overview of the CNN architecture

                                                        Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                                        323 Label Classification

                                                        With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                                        For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                                        29

                                                        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                        20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                                        For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                                        33 Model evaluation

                                                        During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                                        For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                                        For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                                        30

                                                        34 HARDWARE SPECIFICATIONS

                                                        Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                                        34 Hardware Specifications

                                                        The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                                        Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                                        31

                                                        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                        The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                                        The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                                        32

                                                        Chapter 4

                                                        Results

                                                        This chapter presents the results for all the models presented in the previous chapter

                                                        41 LSTM Performance

                                                        Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                                        Figure 41 MAE and MSE loss for the LSTM

                                                        33

                                                        CHAPTER 4 RESULTS

                                                        Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                        Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                        Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                        34

                                                        41 LSTM PERFORMANCE

                                                        Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                        Table 41 Evaluation metrics for the LSTM during regression analysis

                                                        Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                                        Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                                        Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                                        35

                                                        CHAPTER 4 RESULTS

                                                        Table 42 Evaluation metrics for the LSTM during classification analysis

                                                        of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                                        Table 43 LSTM confusion matrix

                                                        PredictionLabel 1 Label 2

                                                        Act

                                                        ual Label 1 109 1

                                                        Label 2 3 669

                                                        42 CNN Performance

                                                        Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                                        Figure 47 MAE and MSE loss for the CNN

                                                        36

                                                        42 CNN PERFORMANCE

                                                        Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                        Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                        Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                        37

                                                        CHAPTER 4 RESULTS

                                                        Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                        Table 44 Evaluation metrics for the CNN during regression analysis

                                                        Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                        Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                        Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                        38

                                                        42 CNN PERFORMANCE

                                                        Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                        Table 45 Evaluation metrics for the CNN during classification analysis

                                                        Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                        Table 46 CNN confusion matrix for data from the MAE regression network

                                                        PredictionLabel 1 Label 2

                                                        Act

                                                        ual Label 1 82 29

                                                        Label 2 38 631

                                                        Table 47 CNN confusion matrix for data from the MSE regression network

                                                        PredictionLabel 1 Label 2

                                                        Act

                                                        ual Label 1 69 41

                                                        Label 2 11 659

                                                        39

                                                        Chapter 5

                                                        Discussion amp Conclusion

                                                        This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                        51 The LSTM Network

                                                        511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                        Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                        The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                        41

                                                        CHAPTER 5 DISCUSSION amp CONCLUSION

                                                        while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                        512 Classification Analysis

                                                        As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                        The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                        52 The CNN

                                                        521 Regression Analysis

                                                        The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                        Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                        42

                                                        52 THE CNN

                                                        is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                        Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                        522 Classification Analysis

                                                        Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                        Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                        However the CNN is still doing a good job at predicting future clogging even

                                                        43

                                                        CHAPTER 5 DISCUSSION amp CONCLUSION

                                                        up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                        53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                        54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                        As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                        44

                                                        Chapter 6

                                                        Future Work

                                                        In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                        For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                        On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                        Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                        45

                                                        Bibliography

                                                        [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                        [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                        [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                        [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                        [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                        [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                        [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                        [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                        [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                        [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                        47

                                                        BIBLIOGRAPHY

                                                        [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                        [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                        [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                        [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                        [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                        [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                        [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                        [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                        [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                        [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                        [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                        48

                                                        BIBLIOGRAPHY

                                                        [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                        [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                        [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                        [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                        [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                        [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                        [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                        [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                        [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                        [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                        [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                        [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                        49

                                                        BIBLIOGRAPHY

                                                        models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                        [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                        [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                        [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                        [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                        [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                        [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                        [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                        [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                        [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                        [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                        50

                                                        BIBLIOGRAPHY

                                                        [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                        [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                        [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                        [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                        [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                        [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                        [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                        51

                                                        TRITA TRITA-ITM-EX 2019606

                                                        wwwkthse

                                                        • Introduction
                                                          • Background
                                                          • Problem Description
                                                          • Purpose Definitions amp Research Questions
                                                          • Scope and Delimitations
                                                          • Method Description
                                                            • Frame of Reference
                                                              • Filtration amp Clogging Indicators
                                                                • Basket Filter
                                                                • Self-Cleaning Basket Filters
                                                                • Manometer
                                                                • The Clogging Phenomena
                                                                • Physics-based Modelling
                                                                  • Predictive Analytics
                                                                    • Classification Error Metrics
                                                                    • Regression Error Metrics
                                                                    • Stochastic Time Series Models
                                                                      • Neural Networks
                                                                        • Overview
                                                                        • The Perceptron
                                                                        • Activation functions
                                                                        • Neural Network Architectures
                                                                            • Experimental Development
                                                                              • Data Gathering and Processing
                                                                              • Model Generation
                                                                                • Regression Processing with the LSTM Model
                                                                                • Regression Processing with the CNN Model
                                                                                • Label Classification
                                                                                  • Model evaluation
                                                                                  • Hardware Specifications
                                                                                    • Results
                                                                                      • LSTM Performance
                                                                                      • CNN Performance
                                                                                        • Discussion amp Conclusion
                                                                                          • The LSTM Network
                                                                                            • Regression Analysis
                                                                                            • Classification Analysis
                                                                                              • The CNN
                                                                                                • Regression Analysis
                                                                                                • Classification Analysis
                                                                                                  • Comparison Between Both Networks
                                                                                                  • Conclusion
                                                                                                    • Future Work
                                                                                                    • Bibliography

                                                          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                          During the filter testing and the gathering of the data the backflush was manuallyinterrupted after a runtime of circa 20 minutes causing an increase in both differ-ential pressure and system fluid flow as can be seen in Figure 31 This was doneprimarily to see how long the filter would cope with the dirtiness of the water duringoperation without its self-cleaning capabilities After discussions with Alfa Lavalabout how to label such an external interference it was decided to remove sectionsof the data containing the stopping of the backflush in order to get complete testcycles At this point the data are unlabelled in terms of clogging and the clogginglabelling is done by visual inspection of the differential pressure and the systemfluid flow

                                                          Figure 32 A test cycle with the backflush stop cut from the data

                                                          The data are labelled through a programming script in accordance to the threeclogging states discussed in section 214 and verified by visual inspection Labellingthe current clogging state as 1 implies that the differential pressure remains linearor that it is yet to pass its initial starting value The clogging label is changed to 2when the differential pressure begins to increase steadily and the system flow eitherremains constant or experiences minor receding effects The label is considered as3 when the change in differential pressure experiences exponential increase and thesystem flow is decreasing drastically No tests were conducted where a clogginglabel of 3 was identified

                                                          24

                                                          31 DATA GATHERING AND PROCESSING

                                                          Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                                                          Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                                                          Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                                                          As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                                                          25

                                                          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                          the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                                                          Table 31 Amount of data available after preprocessing

                                                          Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                                                          Total 3195 1012 2903

                                                          When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                                                          32 Model Generation

                                                          In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                                                          Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                                                          The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                                                          26

                                                          32 MODEL GENERATION

                                                          variables The encoding can be done for both integers and tags such as123

                                                          rarr1 0 0

                                                          0 1 00 0 1

                                                          or

                                                          redbluegreen

                                                          rarr1 0 0

                                                          0 1 00 0 1

                                                          so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                                                          The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                                                          xi minusmin(x)max(x)minusmin(x) (31)

                                                          Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                                                          321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                                                          X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                                                          ](32)

                                                          X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                                                          ](33)

                                                          27

                                                          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                          When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                                                          bull Samples - The amount of data points

                                                          bull Time steps - The points of observation of the samples

                                                          bull Features - The observed variables per time step

                                                          The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                                                          Figure 35 An overview of the LSTM network architecture

                                                          The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                                                          322 Regression Processing with the CNN Model

                                                          As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                                                          28

                                                          32 MODEL GENERATION

                                                          observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                                          The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                                          Figure 36 An overview of the CNN architecture

                                                          Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                                          323 Label Classification

                                                          With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                                          For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                                          29

                                                          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                          20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                                          For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                                          33 Model evaluation

                                                          During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                                          For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                                          For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                                          30

                                                          34 HARDWARE SPECIFICATIONS

                                                          Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                                          34 Hardware Specifications

                                                          The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                                          Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                                          31

                                                          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                          The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                                          The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                                          32

                                                          Chapter 4

                                                          Results

                                                          This chapter presents the results for all the models presented in the previous chapter

                                                          41 LSTM Performance

                                                          Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                                          Figure 41 MAE and MSE loss for the LSTM

                                                          33

                                                          CHAPTER 4 RESULTS

                                                          Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                          Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                          Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                          34

                                                          41 LSTM PERFORMANCE

                                                          Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                          Table 41 Evaluation metrics for the LSTM during regression analysis

                                                          Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                                          Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                                          Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                                          35

                                                          CHAPTER 4 RESULTS

                                                          Table 42 Evaluation metrics for the LSTM during classification analysis

                                                          of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                                          Table 43 LSTM confusion matrix

                                                          PredictionLabel 1 Label 2

                                                          Act

                                                          ual Label 1 109 1

                                                          Label 2 3 669

                                                          42 CNN Performance

                                                          Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                                          Figure 47 MAE and MSE loss for the CNN

                                                          36

                                                          42 CNN PERFORMANCE

                                                          Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                          Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                          Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                          37

                                                          CHAPTER 4 RESULTS

                                                          Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                          Table 44 Evaluation metrics for the CNN during regression analysis

                                                          Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                          Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                          Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                          38

                                                          42 CNN PERFORMANCE

                                                          Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                          Table 45 Evaluation metrics for the CNN during classification analysis

                                                          Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                          Table 46 CNN confusion matrix for data from the MAE regression network

                                                          PredictionLabel 1 Label 2

                                                          Act

                                                          ual Label 1 82 29

                                                          Label 2 38 631

                                                          Table 47 CNN confusion matrix for data from the MSE regression network

                                                          PredictionLabel 1 Label 2

                                                          Act

                                                          ual Label 1 69 41

                                                          Label 2 11 659

                                                          39

                                                          Chapter 5

                                                          Discussion amp Conclusion

                                                          This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                          51 The LSTM Network

                                                          511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                          Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                          The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                          41

                                                          CHAPTER 5 DISCUSSION amp CONCLUSION

                                                          while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                          512 Classification Analysis

                                                          As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                          The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                          52 The CNN

                                                          521 Regression Analysis

                                                          The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                          Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                          42

                                                          52 THE CNN

                                                          is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                          Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                          522 Classification Analysis

                                                          Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                          Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                          However the CNN is still doing a good job at predicting future clogging even

                                                          43

                                                          CHAPTER 5 DISCUSSION amp CONCLUSION

                                                          up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                          53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                          54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                          As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                          44

                                                          Chapter 6

                                                          Future Work

                                                          In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                          For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                          On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                          Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                          45

                                                          Bibliography

                                                          [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                          [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                          [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                          [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                          [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                          [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                          [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                          [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                          [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                          [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                          47

                                                          BIBLIOGRAPHY

                                                          [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                          [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                          [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                          [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                          [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                          [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                          [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                          [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                          [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                          [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                          [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                          48

                                                          BIBLIOGRAPHY

                                                          [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                          [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                          [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                          [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                          [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                          [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                          [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                          [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                          [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                          [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                          [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                          [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                          49

                                                          BIBLIOGRAPHY

                                                          models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                          [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                          [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                          [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                          [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                          [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                          [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                          [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                          [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                          [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                          [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                          50

                                                          BIBLIOGRAPHY

                                                          [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                          [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                          [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                          [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                          [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                          [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                          [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                          51

                                                          TRITA TRITA-ITM-EX 2019606

                                                          wwwkthse

                                                          • Introduction
                                                            • Background
                                                            • Problem Description
                                                            • Purpose Definitions amp Research Questions
                                                            • Scope and Delimitations
                                                            • Method Description
                                                              • Frame of Reference
                                                                • Filtration amp Clogging Indicators
                                                                  • Basket Filter
                                                                  • Self-Cleaning Basket Filters
                                                                  • Manometer
                                                                  • The Clogging Phenomena
                                                                  • Physics-based Modelling
                                                                    • Predictive Analytics
                                                                      • Classification Error Metrics
                                                                      • Regression Error Metrics
                                                                      • Stochastic Time Series Models
                                                                        • Neural Networks
                                                                          • Overview
                                                                          • The Perceptron
                                                                          • Activation functions
                                                                          • Neural Network Architectures
                                                                              • Experimental Development
                                                                                • Data Gathering and Processing
                                                                                • Model Generation
                                                                                  • Regression Processing with the LSTM Model
                                                                                  • Regression Processing with the CNN Model
                                                                                  • Label Classification
                                                                                    • Model evaluation
                                                                                    • Hardware Specifications
                                                                                      • Results
                                                                                        • LSTM Performance
                                                                                        • CNN Performance
                                                                                          • Discussion amp Conclusion
                                                                                            • The LSTM Network
                                                                                              • Regression Analysis
                                                                                              • Classification Analysis
                                                                                                • The CNN
                                                                                                  • Regression Analysis
                                                                                                  • Classification Analysis
                                                                                                    • Comparison Between Both Networks
                                                                                                    • Conclusion
                                                                                                      • Future Work
                                                                                                      • Bibliography

                                                            31 DATA GATHERING AND PROCESSING

                                                            Figure 33 Test data from one test labelled for no clogging (asterisk) and beginningto clog (dot)

                                                            Figure 33 shows the clogging labels and the corresponding differential pressureand system flow rate and existing system pressure for one test cycle A completeoverview of all labelled points in the data set can be seen in Figure 34

                                                            Figure 34 Test data from all tests labelled for no clogging (asterisk) and beginningto clog (dot)

                                                            As is observable a majority of the unclogged data is clustered around low differentialpressure while beginning clogging is more frequently found at higher differentialpressure Furthermore as the the two groups of labels have overlapping data pointsit can be noted that a linear classifier is not enough to distinguish the true label ofthe data as it cannot be entirely separated into two clusters A summary containing

                                                            25

                                                            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                            the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                                                            Table 31 Amount of data available after preprocessing

                                                            Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                                                            Total 3195 1012 2903

                                                            When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                                                            32 Model Generation

                                                            In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                                                            Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                                                            The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                                                            26

                                                            32 MODEL GENERATION

                                                            variables The encoding can be done for both integers and tags such as123

                                                            rarr1 0 0

                                                            0 1 00 0 1

                                                            or

                                                            redbluegreen

                                                            rarr1 0 0

                                                            0 1 00 0 1

                                                            so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                                                            The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                                                            xi minusmin(x)max(x)minusmin(x) (31)

                                                            Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                                                            321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                                                            X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                                                            ](32)

                                                            X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                                                            ](33)

                                                            27

                                                            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                            When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                                                            bull Samples - The amount of data points

                                                            bull Time steps - The points of observation of the samples

                                                            bull Features - The observed variables per time step

                                                            The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                                                            Figure 35 An overview of the LSTM network architecture

                                                            The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                                                            322 Regression Processing with the CNN Model

                                                            As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                                                            28

                                                            32 MODEL GENERATION

                                                            observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                                            The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                                            Figure 36 An overview of the CNN architecture

                                                            Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                                            323 Label Classification

                                                            With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                                            For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                                            29

                                                            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                            20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                                            For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                                            33 Model evaluation

                                                            During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                                            For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                                            For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                                            30

                                                            34 HARDWARE SPECIFICATIONS

                                                            Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                                            34 Hardware Specifications

                                                            The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                                            Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                                            31

                                                            CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                            The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                                            The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                                            32

                                                            Chapter 4

                                                            Results

                                                            This chapter presents the results for all the models presented in the previous chapter

                                                            41 LSTM Performance

                                                            Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                                            Figure 41 MAE and MSE loss for the LSTM

                                                            33

                                                            CHAPTER 4 RESULTS

                                                            Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                            Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                            Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                            34

                                                            41 LSTM PERFORMANCE

                                                            Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                            Table 41 Evaluation metrics for the LSTM during regression analysis

                                                            Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                                            Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                                            Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                                            35

                                                            CHAPTER 4 RESULTS

                                                            Table 42 Evaluation metrics for the LSTM during classification analysis

                                                            of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                                            Table 43 LSTM confusion matrix

                                                            PredictionLabel 1 Label 2

                                                            Act

                                                            ual Label 1 109 1

                                                            Label 2 3 669

                                                            42 CNN Performance

                                                            Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                                            Figure 47 MAE and MSE loss for the CNN

                                                            36

                                                            42 CNN PERFORMANCE

                                                            Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                            Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                            Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                            37

                                                            CHAPTER 4 RESULTS

                                                            Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                            Table 44 Evaluation metrics for the CNN during regression analysis

                                                            Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                            Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                            Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                            38

                                                            42 CNN PERFORMANCE

                                                            Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                            Table 45 Evaluation metrics for the CNN during classification analysis

                                                            Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                            Table 46 CNN confusion matrix for data from the MAE regression network

                                                            PredictionLabel 1 Label 2

                                                            Act

                                                            ual Label 1 82 29

                                                            Label 2 38 631

                                                            Table 47 CNN confusion matrix for data from the MSE regression network

                                                            PredictionLabel 1 Label 2

                                                            Act

                                                            ual Label 1 69 41

                                                            Label 2 11 659

                                                            39

                                                            Chapter 5

                                                            Discussion amp Conclusion

                                                            This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                            51 The LSTM Network

                                                            511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                            Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                            The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                            41

                                                            CHAPTER 5 DISCUSSION amp CONCLUSION

                                                            while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                            512 Classification Analysis

                                                            As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                            The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                            52 The CNN

                                                            521 Regression Analysis

                                                            The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                            Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                            42

                                                            52 THE CNN

                                                            is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                            Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                            522 Classification Analysis

                                                            Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                            Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                            However the CNN is still doing a good job at predicting future clogging even

                                                            43

                                                            CHAPTER 5 DISCUSSION amp CONCLUSION

                                                            up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                            53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                            54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                            As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                            44

                                                            Chapter 6

                                                            Future Work

                                                            In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                            For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                            On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                            Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                            45

                                                            Bibliography

                                                            [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                            [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                            [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                            [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                            [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                            [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                            [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                            [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                            [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                            [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                            47

                                                            BIBLIOGRAPHY

                                                            [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                            [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                            [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                            [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                            [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                            [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                            [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                            [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                            [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                            [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                            [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                            48

                                                            BIBLIOGRAPHY

                                                            [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                            [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                            [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                            [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                            [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                            [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                            [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                            [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                            [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                            [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                            [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                            [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                            49

                                                            BIBLIOGRAPHY

                                                            models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                            [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                            [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                            [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                            [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                            [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                            [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                            [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                            [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                            [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                            [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                            50

                                                            BIBLIOGRAPHY

                                                            [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                            [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                            [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                            [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                            [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                            [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                            [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                            51

                                                            TRITA TRITA-ITM-EX 2019606

                                                            wwwkthse

                                                            • Introduction
                                                              • Background
                                                              • Problem Description
                                                              • Purpose Definitions amp Research Questions
                                                              • Scope and Delimitations
                                                              • Method Description
                                                                • Frame of Reference
                                                                  • Filtration amp Clogging Indicators
                                                                    • Basket Filter
                                                                    • Self-Cleaning Basket Filters
                                                                    • Manometer
                                                                    • The Clogging Phenomena
                                                                    • Physics-based Modelling
                                                                      • Predictive Analytics
                                                                        • Classification Error Metrics
                                                                        • Regression Error Metrics
                                                                        • Stochastic Time Series Models
                                                                          • Neural Networks
                                                                            • Overview
                                                                            • The Perceptron
                                                                            • Activation functions
                                                                            • Neural Network Architectures
                                                                                • Experimental Development
                                                                                  • Data Gathering and Processing
                                                                                  • Model Generation
                                                                                    • Regression Processing with the LSTM Model
                                                                                    • Regression Processing with the CNN Model
                                                                                    • Label Classification
                                                                                      • Model evaluation
                                                                                      • Hardware Specifications
                                                                                        • Results
                                                                                          • LSTM Performance
                                                                                          • CNN Performance
                                                                                            • Discussion amp Conclusion
                                                                                              • The LSTM Network
                                                                                                • Regression Analysis
                                                                                                • Classification Analysis
                                                                                                  • The CNN
                                                                                                    • Regression Analysis
                                                                                                    • Classification Analysis
                                                                                                      • Comparison Between Both Networks
                                                                                                      • Conclusion
                                                                                                        • Future Work
                                                                                                        • Bibliography

                                                              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                              the amount of data points and respective clogging labels for each test cycle can befound in Table 31

                                                              Table 31 Amount of data available after preprocessing

                                                              Test Samples Points labelled clog-1 Points labelled clog-2I 685 685 0II 220 25 195III 340 35 305IV 210 11 199V 375 32 343VI 355 7 348VII 360 78 282VIII 345 19 326IX 350 10 340X 335 67 268XI 340 43 297

                                                              Total 3195 1012 2903

                                                              When preprocessing was finished the entire dataset contains 3915 samples with 1012samples labelled as clogging label 1 and 2903 samples labelled as clogging label 2

                                                              32 Model Generation

                                                              In order to obtain the data required for predicting clogging the pre-processed datawere put through two neural networks to be evaluated as a regression problemThe regression analysis allows for gathering and preparing a set of predicted valuesof each parameter as well as the corresponding clogging label From the conceptgeneration phase and the current use of neural networks to evaluate multivariatetime series two network models were used the LSTM and the CNN The LSTM toinitially test the suitability of the data for time series forecasting and the CNN formulti-step time series forecasting

                                                              Before the regression analysis could begin the pre-processed data had to be pro-cessed further to increase network accuracy and time efficiency Because of the factthat large values in the input data can result in a model forced to learn large weightsthus resulting in an unstable model a label transform and a scaler transform areapplied to the input data The purpose of the encoder transform is to retain thedifference between the determined clogging labels and the scaler transform ensuresthat the data is within an appropriate scale range

                                                              The label transform applied is known as one hot encoding One hot encoding takescategorical variables removes them and generates a binary representation of the

                                                              26

                                                              32 MODEL GENERATION

                                                              variables The encoding can be done for both integers and tags such as123

                                                              rarr1 0 0

                                                              0 1 00 0 1

                                                              or

                                                              redbluegreen

                                                              rarr1 0 0

                                                              0 1 00 0 1

                                                              so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                                                              The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                                                              xi minusmin(x)max(x)minusmin(x) (31)

                                                              Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                                                              321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                                                              X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                                                              ](32)

                                                              X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                                                              ](33)

                                                              27

                                                              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                              When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                                                              bull Samples - The amount of data points

                                                              bull Time steps - The points of observation of the samples

                                                              bull Features - The observed variables per time step

                                                              The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                                                              Figure 35 An overview of the LSTM network architecture

                                                              The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                                                              322 Regression Processing with the CNN Model

                                                              As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                                                              28

                                                              32 MODEL GENERATION

                                                              observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                                              The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                                              Figure 36 An overview of the CNN architecture

                                                              Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                                              323 Label Classification

                                                              With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                                              For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                                              29

                                                              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                              20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                                              For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                                              33 Model evaluation

                                                              During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                                              For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                                              For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                                              30

                                                              34 HARDWARE SPECIFICATIONS

                                                              Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                                              34 Hardware Specifications

                                                              The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                                              Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                                              31

                                                              CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                              The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                                              The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                                              32

                                                              Chapter 4

                                                              Results

                                                              This chapter presents the results for all the models presented in the previous chapter

                                                              41 LSTM Performance

                                                              Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                                              Figure 41 MAE and MSE loss for the LSTM

                                                              33

                                                              CHAPTER 4 RESULTS

                                                              Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                              Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                              Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                              34

                                                              41 LSTM PERFORMANCE

                                                              Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                              Table 41 Evaluation metrics for the LSTM during regression analysis

                                                              Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                                              Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                                              Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                                              35

                                                              CHAPTER 4 RESULTS

                                                              Table 42 Evaluation metrics for the LSTM during classification analysis

                                                              of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                                              Table 43 LSTM confusion matrix

                                                              PredictionLabel 1 Label 2

                                                              Act

                                                              ual Label 1 109 1

                                                              Label 2 3 669

                                                              42 CNN Performance

                                                              Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                                              Figure 47 MAE and MSE loss for the CNN

                                                              36

                                                              42 CNN PERFORMANCE

                                                              Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                              Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                              Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                              37

                                                              CHAPTER 4 RESULTS

                                                              Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                              Table 44 Evaluation metrics for the CNN during regression analysis

                                                              Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                              Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                              Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                              38

                                                              42 CNN PERFORMANCE

                                                              Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                              Table 45 Evaluation metrics for the CNN during classification analysis

                                                              Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                              Table 46 CNN confusion matrix for data from the MAE regression network

                                                              PredictionLabel 1 Label 2

                                                              Act

                                                              ual Label 1 82 29

                                                              Label 2 38 631

                                                              Table 47 CNN confusion matrix for data from the MSE regression network

                                                              PredictionLabel 1 Label 2

                                                              Act

                                                              ual Label 1 69 41

                                                              Label 2 11 659

                                                              39

                                                              Chapter 5

                                                              Discussion amp Conclusion

                                                              This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                              51 The LSTM Network

                                                              511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                              Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                              The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                              41

                                                              CHAPTER 5 DISCUSSION amp CONCLUSION

                                                              while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                              512 Classification Analysis

                                                              As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                              The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                              52 The CNN

                                                              521 Regression Analysis

                                                              The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                              Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                              42

                                                              52 THE CNN

                                                              is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                              Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                              522 Classification Analysis

                                                              Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                              Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                              However the CNN is still doing a good job at predicting future clogging even

                                                              43

                                                              CHAPTER 5 DISCUSSION amp CONCLUSION

                                                              up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                              53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                              54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                              As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                              44

                                                              Chapter 6

                                                              Future Work

                                                              In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                              For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                              On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                              Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                              45

                                                              Bibliography

                                                              [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                              [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                              [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                              [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                              [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                              [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                              [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                              [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                              [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                              [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                              47

                                                              BIBLIOGRAPHY

                                                              [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                              [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                              [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                              [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                              [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                              [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                              [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                              [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                              [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                              [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                              [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                              48

                                                              BIBLIOGRAPHY

                                                              [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                              [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                              [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                              [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                              [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                              [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                              [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                              [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                              [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                              [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                              [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                              [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                              49

                                                              BIBLIOGRAPHY

                                                              models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                              [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                              [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                              [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                              [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                              [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                              [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                              [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                              [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                              [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                              [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                              50

                                                              BIBLIOGRAPHY

                                                              [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                              [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                              [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                              [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                              [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                              [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                              [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                              51

                                                              TRITA TRITA-ITM-EX 2019606

                                                              wwwkthse

                                                              • Introduction
                                                                • Background
                                                                • Problem Description
                                                                • Purpose Definitions amp Research Questions
                                                                • Scope and Delimitations
                                                                • Method Description
                                                                  • Frame of Reference
                                                                    • Filtration amp Clogging Indicators
                                                                      • Basket Filter
                                                                      • Self-Cleaning Basket Filters
                                                                      • Manometer
                                                                      • The Clogging Phenomena
                                                                      • Physics-based Modelling
                                                                        • Predictive Analytics
                                                                          • Classification Error Metrics
                                                                          • Regression Error Metrics
                                                                          • Stochastic Time Series Models
                                                                            • Neural Networks
                                                                              • Overview
                                                                              • The Perceptron
                                                                              • Activation functions
                                                                              • Neural Network Architectures
                                                                                  • Experimental Development
                                                                                    • Data Gathering and Processing
                                                                                    • Model Generation
                                                                                      • Regression Processing with the LSTM Model
                                                                                      • Regression Processing with the CNN Model
                                                                                      • Label Classification
                                                                                        • Model evaluation
                                                                                        • Hardware Specifications
                                                                                          • Results
                                                                                            • LSTM Performance
                                                                                            • CNN Performance
                                                                                              • Discussion amp Conclusion
                                                                                                • The LSTM Network
                                                                                                  • Regression Analysis
                                                                                                  • Classification Analysis
                                                                                                    • The CNN
                                                                                                      • Regression Analysis
                                                                                                      • Classification Analysis
                                                                                                        • Comparison Between Both Networks
                                                                                                        • Conclusion
                                                                                                          • Future Work
                                                                                                          • Bibliography

                                                                32 MODEL GENERATION

                                                                variables The encoding can be done for both integers and tags such as123

                                                                rarr1 0 0

                                                                0 1 00 0 1

                                                                or

                                                                redbluegreen

                                                                rarr1 0 0

                                                                0 1 00 0 1

                                                                so that each new column corresponds to a different value of the initial variableOne hot encoding ensures that each category is treated and predicted indifferentlywithout assuming that one category is more important because we want to equallypredict all the actual classification labels rather than prioritize a certain categoryThe precision of one hot encoding in comparison to other equally simple encodingtechniques has shown by Seger [49] to be equal Potdar et al [50] show that one hotencoding achieves sufficiently higher accuracy than simple encoding techniques butthat there are also more sophisticated options available that achieve higher accuracy

                                                                The scaler transform used is the min-max scaler The min-max scaler shrinksthe range of the dataset so that it is between 0 and 1 by applying the followingtransform to every feature

                                                                xi minusmin(x)max(x)minusmin(x) (31)

                                                                Using the min-max-scaler to normalize that data is useful because it helps to avoidthe generation of large weights The transform is also easy to inverse which makesit possible to revert back to the original values after processing

                                                                321 Regression Processing with the LSTM ModelBefore the data are sent through the LSTM each variable is processed by a sequenc-ing function (SF) The SF decides the amount of past values that should match afuture prediction In this case the function dictates the scale of the time windowof previous measurements to predict the measurement of one time step The LSTMmodel uses 5 previous values per prediction making the time window 25 secondslong and the prediction a 5 second foresight Each categorical variable in the orig-inal dataset is considered a feature in the data That means that by processingthe data through the sequencing function the set of features that correspond toone value is expanded accordingly with the time window The difference from theexpansion of the features can be described by Equation 32 and Equation 33 Itshould be noted that while the set of features per time step increases the size ofthe dataset is decreased proportionally to how many past time steps are used asmore measurements are required per time step

                                                                X(t) =[V1(t) V2(t) Vnminus1(t) Vn(t)

                                                                ](32)

                                                                X(t) =[V1(tminus 5) V2(tminus 5) Vnminus1(t) Vn(t)

                                                                ](33)

                                                                27

                                                                CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                                When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                                                                bull Samples - The amount of data points

                                                                bull Time steps - The points of observation of the samples

                                                                bull Features - The observed variables per time step

                                                                The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                                                                Figure 35 An overview of the LSTM network architecture

                                                                The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                                                                322 Regression Processing with the CNN Model

                                                                As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                                                                28

                                                                32 MODEL GENERATION

                                                                observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                                                The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                                                Figure 36 An overview of the CNN architecture

                                                                Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                                                323 Label Classification

                                                                With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                                                For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                                                29

                                                                CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                                20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                                                For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                                                33 Model evaluation

                                                                During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                                                For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                                                For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                                                30

                                                                34 HARDWARE SPECIFICATIONS

                                                                Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                                                34 Hardware Specifications

                                                                The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                                                Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                                                31

                                                                CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                                The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                                                The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                                                32

                                                                Chapter 4

                                                                Results

                                                                This chapter presents the results for all the models presented in the previous chapter

                                                                41 LSTM Performance

                                                                Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                                                Figure 41 MAE and MSE loss for the LSTM

                                                                33

                                                                CHAPTER 4 RESULTS

                                                                Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                34

                                                                41 LSTM PERFORMANCE

                                                                Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                Table 41 Evaluation metrics for the LSTM during regression analysis

                                                                Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                                                Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                                                Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                                                35

                                                                CHAPTER 4 RESULTS

                                                                Table 42 Evaluation metrics for the LSTM during classification analysis

                                                                of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                                                Table 43 LSTM confusion matrix

                                                                PredictionLabel 1 Label 2

                                                                Act

                                                                ual Label 1 109 1

                                                                Label 2 3 669

                                                                42 CNN Performance

                                                                Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                                                Figure 47 MAE and MSE loss for the CNN

                                                                36

                                                                42 CNN PERFORMANCE

                                                                Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                37

                                                                CHAPTER 4 RESULTS

                                                                Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                Table 44 Evaluation metrics for the CNN during regression analysis

                                                                Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                                Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                                Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                                38

                                                                42 CNN PERFORMANCE

                                                                Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                                Table 45 Evaluation metrics for the CNN during classification analysis

                                                                Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                                Table 46 CNN confusion matrix for data from the MAE regression network

                                                                PredictionLabel 1 Label 2

                                                                Act

                                                                ual Label 1 82 29

                                                                Label 2 38 631

                                                                Table 47 CNN confusion matrix for data from the MSE regression network

                                                                PredictionLabel 1 Label 2

                                                                Act

                                                                ual Label 1 69 41

                                                                Label 2 11 659

                                                                39

                                                                Chapter 5

                                                                Discussion amp Conclusion

                                                                This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                                51 The LSTM Network

                                                                511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                                Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                                The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                                41

                                                                CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                                512 Classification Analysis

                                                                As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                                The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                                52 The CNN

                                                                521 Regression Analysis

                                                                The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                                Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                                42

                                                                52 THE CNN

                                                                is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                                Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                                522 Classification Analysis

                                                                Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                                Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                                However the CNN is still doing a good job at predicting future clogging even

                                                                43

                                                                CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                                53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                                54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                                As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                                44

                                                                Chapter 6

                                                                Future Work

                                                                In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                                For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                                On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                                Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                                45

                                                                Bibliography

                                                                [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                47

                                                                BIBLIOGRAPHY

                                                                [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                48

                                                                BIBLIOGRAPHY

                                                                [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                49

                                                                BIBLIOGRAPHY

                                                                models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                50

                                                                BIBLIOGRAPHY

                                                                [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                51

                                                                TRITA TRITA-ITM-EX 2019606

                                                                wwwkthse

                                                                • Introduction
                                                                  • Background
                                                                  • Problem Description
                                                                  • Purpose Definitions amp Research Questions
                                                                  • Scope and Delimitations
                                                                  • Method Description
                                                                    • Frame of Reference
                                                                      • Filtration amp Clogging Indicators
                                                                        • Basket Filter
                                                                        • Self-Cleaning Basket Filters
                                                                        • Manometer
                                                                        • The Clogging Phenomena
                                                                        • Physics-based Modelling
                                                                          • Predictive Analytics
                                                                            • Classification Error Metrics
                                                                            • Regression Error Metrics
                                                                            • Stochastic Time Series Models
                                                                              • Neural Networks
                                                                                • Overview
                                                                                • The Perceptron
                                                                                • Activation functions
                                                                                • Neural Network Architectures
                                                                                    • Experimental Development
                                                                                      • Data Gathering and Processing
                                                                                      • Model Generation
                                                                                        • Regression Processing with the LSTM Model
                                                                                        • Regression Processing with the CNN Model
                                                                                        • Label Classification
                                                                                          • Model evaluation
                                                                                          • Hardware Specifications
                                                                                            • Results
                                                                                              • LSTM Performance
                                                                                              • CNN Performance
                                                                                                • Discussion amp Conclusion
                                                                                                  • The LSTM Network
                                                                                                    • Regression Analysis
                                                                                                    • Classification Analysis
                                                                                                      • The CNN
                                                                                                        • Regression Analysis
                                                                                                        • Classification Analysis
                                                                                                          • Comparison Between Both Networks
                                                                                                          • Conclusion
                                                                                                            • Future Work
                                                                                                            • Bibliography

                                                                  CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                                  When sequenced the data is split into a training set consisting of 80 of the dataand a validation set consisting of 20 of the data Both the training set and thevalidation set contains input data and the corresponding true output data Oncesplit the data can finally be reshaped to be put through the LSTM The reshapingensures that the three dimensions of the data are defined by

                                                                  bull Samples - The amount of data points

                                                                  bull Time steps - The points of observation of the samples

                                                                  bull Features - The observed variables per time step

                                                                  The network consists of two LSTM layers with the ReLU activation function thatinitially processes the input data before they are passed to the output layer with thesigmoid activation function There the data output by the network is compared tothe true output data to adjust the weights for achieving a better output Each LSTMlayer contains 32 neurons and the output layer contains 1 neuron for parameterprediction

                                                                  Figure 35 An overview of the LSTM network architecture

                                                                  The training of the network is run for 1500 epochs but with a forced early stopwhen the validation loss has not seen any improvement for 150 subsequent epochsLimiting the network at training in such a way ensures that the network is notoverfitted to the training data

                                                                  322 Regression Processing with the CNN Model

                                                                  As with the LSTM the input data require some additional processing before it canbe fed through the CNN The dataset is fed through a sequence splitting function(SSF) that will extract samples from the dataset to give the data the correct di-mensions Just like the LSTM the dimensions are samples time steps and featuresSpecified in the SSF is the time window of past observations to be used for pre-diction as well as the amount of observations to be predicted The time windowfor past observations encompasses 12 observations and therefore uses observationsfrom the past 60 seconds whereas the time window for future predictions is set to 6

                                                                  28

                                                                  32 MODEL GENERATION

                                                                  observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                                                  The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                                                  Figure 36 An overview of the CNN architecture

                                                                  Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                                                  323 Label Classification

                                                                  With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                                                  For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                                                  29

                                                                  CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                                  20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                                                  For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                                                  33 Model evaluation

                                                                  During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                                                  For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                                                  For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                                                  30

                                                                  34 HARDWARE SPECIFICATIONS

                                                                  Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                                                  34 Hardware Specifications

                                                                  The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                                                  Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                                                  31

                                                                  CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                                  The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                                                  The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                                                  32

                                                                  Chapter 4

                                                                  Results

                                                                  This chapter presents the results for all the models presented in the previous chapter

                                                                  41 LSTM Performance

                                                                  Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                                                  Figure 41 MAE and MSE loss for the LSTM

                                                                  33

                                                                  CHAPTER 4 RESULTS

                                                                  Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                  Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                  Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                  34

                                                                  41 LSTM PERFORMANCE

                                                                  Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                  Table 41 Evaluation metrics for the LSTM during regression analysis

                                                                  Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                                                  Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                                                  Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                                                  35

                                                                  CHAPTER 4 RESULTS

                                                                  Table 42 Evaluation metrics for the LSTM during classification analysis

                                                                  of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                                                  Table 43 LSTM confusion matrix

                                                                  PredictionLabel 1 Label 2

                                                                  Act

                                                                  ual Label 1 109 1

                                                                  Label 2 3 669

                                                                  42 CNN Performance

                                                                  Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                                                  Figure 47 MAE and MSE loss for the CNN

                                                                  36

                                                                  42 CNN PERFORMANCE

                                                                  Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                  Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                  Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                  37

                                                                  CHAPTER 4 RESULTS

                                                                  Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                  Table 44 Evaluation metrics for the CNN during regression analysis

                                                                  Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                                  Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                                  Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                                  38

                                                                  42 CNN PERFORMANCE

                                                                  Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                                  Table 45 Evaluation metrics for the CNN during classification analysis

                                                                  Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                                  Table 46 CNN confusion matrix for data from the MAE regression network

                                                                  PredictionLabel 1 Label 2

                                                                  Act

                                                                  ual Label 1 82 29

                                                                  Label 2 38 631

                                                                  Table 47 CNN confusion matrix for data from the MSE regression network

                                                                  PredictionLabel 1 Label 2

                                                                  Act

                                                                  ual Label 1 69 41

                                                                  Label 2 11 659

                                                                  39

                                                                  Chapter 5

                                                                  Discussion amp Conclusion

                                                                  This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                                  51 The LSTM Network

                                                                  511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                                  Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                                  The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                                  41

                                                                  CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                  while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                                  512 Classification Analysis

                                                                  As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                                  The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                                  52 The CNN

                                                                  521 Regression Analysis

                                                                  The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                                  Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                                  42

                                                                  52 THE CNN

                                                                  is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                                  Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                                  522 Classification Analysis

                                                                  Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                                  Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                                  However the CNN is still doing a good job at predicting future clogging even

                                                                  43

                                                                  CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                  up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                                  53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                                  54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                                  As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                                  44

                                                                  Chapter 6

                                                                  Future Work

                                                                  In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                                  For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                                  On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                                  Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                                  45

                                                                  Bibliography

                                                                  [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                  [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                  [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                  [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                  [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                  [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                  [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                  [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                  [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                  [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                  47

                                                                  BIBLIOGRAPHY

                                                                  [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                  [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                  [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                  [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                  [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                  [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                  [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                  [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                  [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                  [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                  [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                  48

                                                                  BIBLIOGRAPHY

                                                                  [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                  [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                  [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                  [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                  [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                  [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                  [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                  [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                  [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                  [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                  [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                  [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                  49

                                                                  BIBLIOGRAPHY

                                                                  models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                  [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                  [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                  [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                  [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                  [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                  [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                  [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                  [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                  [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                  [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                  50

                                                                  BIBLIOGRAPHY

                                                                  [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                  [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                  [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                  [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                  [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                  [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                  [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                  51

                                                                  TRITA TRITA-ITM-EX 2019606

                                                                  wwwkthse

                                                                  • Introduction
                                                                    • Background
                                                                    • Problem Description
                                                                    • Purpose Definitions amp Research Questions
                                                                    • Scope and Delimitations
                                                                    • Method Description
                                                                      • Frame of Reference
                                                                        • Filtration amp Clogging Indicators
                                                                          • Basket Filter
                                                                          • Self-Cleaning Basket Filters
                                                                          • Manometer
                                                                          • The Clogging Phenomena
                                                                          • Physics-based Modelling
                                                                            • Predictive Analytics
                                                                              • Classification Error Metrics
                                                                              • Regression Error Metrics
                                                                              • Stochastic Time Series Models
                                                                                • Neural Networks
                                                                                  • Overview
                                                                                  • The Perceptron
                                                                                  • Activation functions
                                                                                  • Neural Network Architectures
                                                                                      • Experimental Development
                                                                                        • Data Gathering and Processing
                                                                                        • Model Generation
                                                                                          • Regression Processing with the LSTM Model
                                                                                          • Regression Processing with the CNN Model
                                                                                          • Label Classification
                                                                                            • Model evaluation
                                                                                            • Hardware Specifications
                                                                                              • Results
                                                                                                • LSTM Performance
                                                                                                • CNN Performance
                                                                                                  • Discussion amp Conclusion
                                                                                                    • The LSTM Network
                                                                                                      • Regression Analysis
                                                                                                      • Classification Analysis
                                                                                                        • The CNN
                                                                                                          • Regression Analysis
                                                                                                          • Classification Analysis
                                                                                                            • Comparison Between Both Networks
                                                                                                            • Conclusion
                                                                                                              • Future Work
                                                                                                              • Bibliography

                                                                    32 MODEL GENERATION

                                                                    observations giving the predicted clogging state and rate for the coming 30 secondsThe dataset is then split into training and validation sets of 80 and 20 respec-tively of the amount of original data Like the data in the LSTM the training setand the validation set contains input data as well as data for what is the correctoutput for that input

                                                                    The architecture of the network can be seen in Figure 36 The convolutional layertakes an argument to decide the amount of filterskernels to pass over the inputdata In this case 64 different filters are used and passed over the data with a ker-nel size of 4 time steps to generate the feature map The feature map then passesthrough the max pooling layer with a pool size of 2 further reducing the map Themap is then flattened before it is passed through two fully connected layers onewith 50 nodes and the last one with nodes to equally match the desired amount ofpredictions in this case 6

                                                                    Figure 36 An overview of the CNN architecture

                                                                    Similarly to the LSTM the CNN is set to be trained for 1500 epochs but witha forced early stop when the validation loss hasnrsquot seen any improvement for 150subsequent epochs

                                                                    323 Label Classification

                                                                    With the data from the regression analysis the label classification could be doneFor classification with the LSTM the same network structure was used as for regres-sion as it can be decided in the network directly which variable the network shouldpredict andor evaluate The data were again split into a training set consisting of80 of the data and a validation set consisting of 20 of the data

                                                                    For the CNN two sets of data were extracted from the CNN networks used in theregression analysis The data consisted of the true observations y and the predictedobservations y for each parameter in the original dataset The true observationswere used for training and validation when creating the network and the predictedobservations were used for evaluating the accuracy of the clogging label classifica-tion Likewise the training and validation data were split into parts of 80 and

                                                                    29

                                                                    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                                    20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                                                    For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                                                    33 Model evaluation

                                                                    During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                                                    For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                                                    For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                                                    30

                                                                    34 HARDWARE SPECIFICATIONS

                                                                    Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                                                    34 Hardware Specifications

                                                                    The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                                                    Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                                                    31

                                                                    CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                                    The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                                                    The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                                                    32

                                                                    Chapter 4

                                                                    Results

                                                                    This chapter presents the results for all the models presented in the previous chapter

                                                                    41 LSTM Performance

                                                                    Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                                                    Figure 41 MAE and MSE loss for the LSTM

                                                                    33

                                                                    CHAPTER 4 RESULTS

                                                                    Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                    Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                    Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                    34

                                                                    41 LSTM PERFORMANCE

                                                                    Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                    Table 41 Evaluation metrics for the LSTM during regression analysis

                                                                    Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                                                    Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                                                    Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                                                    35

                                                                    CHAPTER 4 RESULTS

                                                                    Table 42 Evaluation metrics for the LSTM during classification analysis

                                                                    of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                                                    Table 43 LSTM confusion matrix

                                                                    PredictionLabel 1 Label 2

                                                                    Act

                                                                    ual Label 1 109 1

                                                                    Label 2 3 669

                                                                    42 CNN Performance

                                                                    Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                                                    Figure 47 MAE and MSE loss for the CNN

                                                                    36

                                                                    42 CNN PERFORMANCE

                                                                    Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                    Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                    Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                    37

                                                                    CHAPTER 4 RESULTS

                                                                    Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                    Table 44 Evaluation metrics for the CNN during regression analysis

                                                                    Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                                    Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                                    Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                                    38

                                                                    42 CNN PERFORMANCE

                                                                    Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                                    Table 45 Evaluation metrics for the CNN during classification analysis

                                                                    Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                                    Table 46 CNN confusion matrix for data from the MAE regression network

                                                                    PredictionLabel 1 Label 2

                                                                    Act

                                                                    ual Label 1 82 29

                                                                    Label 2 38 631

                                                                    Table 47 CNN confusion matrix for data from the MSE regression network

                                                                    PredictionLabel 1 Label 2

                                                                    Act

                                                                    ual Label 1 69 41

                                                                    Label 2 11 659

                                                                    39

                                                                    Chapter 5

                                                                    Discussion amp Conclusion

                                                                    This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                                    51 The LSTM Network

                                                                    511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                                    Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                                    The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                                    41

                                                                    CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                    while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                                    512 Classification Analysis

                                                                    As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                                    The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                                    52 The CNN

                                                                    521 Regression Analysis

                                                                    The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                                    Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                                    42

                                                                    52 THE CNN

                                                                    is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                                    Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                                    522 Classification Analysis

                                                                    Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                                    Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                                    However the CNN is still doing a good job at predicting future clogging even

                                                                    43

                                                                    CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                    up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                                    53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                                    54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                                    As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                                    44

                                                                    Chapter 6

                                                                    Future Work

                                                                    In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                                    For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                                    On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                                    Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                                    45

                                                                    Bibliography

                                                                    [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                    [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                    [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                    [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                    [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                    [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                    [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                    [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                    [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                    [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                    47

                                                                    BIBLIOGRAPHY

                                                                    [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                    [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                    [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                    [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                    [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                    [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                    [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                    [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                    [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                    [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                    [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                    48

                                                                    BIBLIOGRAPHY

                                                                    [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                    [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                    [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                    [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                    [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                    [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                    [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                    [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                    [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                    [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                    [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                    [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                    49

                                                                    BIBLIOGRAPHY

                                                                    models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                    [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                    [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                    [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                    [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                    [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                    [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                    [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                    [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                    [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                    [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                    50

                                                                    BIBLIOGRAPHY

                                                                    [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                    [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                    [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                    [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                    [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                    [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                    [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                    51

                                                                    TRITA TRITA-ITM-EX 2019606

                                                                    wwwkthse

                                                                    • Introduction
                                                                      • Background
                                                                      • Problem Description
                                                                      • Purpose Definitions amp Research Questions
                                                                      • Scope and Delimitations
                                                                      • Method Description
                                                                        • Frame of Reference
                                                                          • Filtration amp Clogging Indicators
                                                                            • Basket Filter
                                                                            • Self-Cleaning Basket Filters
                                                                            • Manometer
                                                                            • The Clogging Phenomena
                                                                            • Physics-based Modelling
                                                                              • Predictive Analytics
                                                                                • Classification Error Metrics
                                                                                • Regression Error Metrics
                                                                                • Stochastic Time Series Models
                                                                                  • Neural Networks
                                                                                    • Overview
                                                                                    • The Perceptron
                                                                                    • Activation functions
                                                                                    • Neural Network Architectures
                                                                                        • Experimental Development
                                                                                          • Data Gathering and Processing
                                                                                          • Model Generation
                                                                                            • Regression Processing with the LSTM Model
                                                                                            • Regression Processing with the CNN Model
                                                                                            • Label Classification
                                                                                              • Model evaluation
                                                                                              • Hardware Specifications
                                                                                                • Results
                                                                                                  • LSTM Performance
                                                                                                  • CNN Performance
                                                                                                    • Discussion amp Conclusion
                                                                                                      • The LSTM Network
                                                                                                        • Regression Analysis
                                                                                                        • Classification Analysis
                                                                                                          • The CNN
                                                                                                            • Regression Analysis
                                                                                                            • Classification Analysis
                                                                                                              • Comparison Between Both Networks
                                                                                                              • Conclusion
                                                                                                                • Future Work
                                                                                                                • Bibliography

                                                                      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                                      20 respectively The testing set was split into the same fractions but only thefraction of 20 was kept to equally match the size of the validation set

                                                                      For classification the input data and output data were adjusted so that the inputdata only contained the values of the variables and the output data only containedthe clogging labels The adjustment learns the network that certain values of vari-ables correspond to a specific clogging label The classification CNN was trainedon the training data and validated on the validation set The data from the testingset were then fed through the network and compared to the validation set

                                                                      33 Model evaluation

                                                                      During training on both networks the weights in the layers are tweaked and opti-mized according to a loss function The loss function is selected to improve andevaluate the networks capabilities of achieving a high rate of classification or re-gression on certain variables In essence some loss functions are better suited forevaluating the networks than what they would be for a regression problem and viceversa This led to different loss functions being used when training the network forpredicting a future clogging labels than when training the network to predict futurevalues of system variables

                                                                      For the regression analysis both MSE and MAE were used When using MSElarge errors would be more penalizing as they come from outliers and an overalllow MSE would indicate that the output is normally distributed based on the inputdata MAE would allow outliers to play a smaller role and produce a good MAEscore if the distribution is multimodal For a multimodal distribution a predictionat the mean of two modes would result in a bad score as is generated by the MSEwhile the MAE will allow for predictions at each individual mode To summariseMAE is more robust to outliers while MSE is more sensitive to outliers

                                                                      For the clogging labels the network used a loss function for minimising the bi-nary cross-entropy (also known as log loss) As can be seen in Figure 37 identicalvalues of the same variable can belong to different clogging labels Therefore theloss function has to be able to deduce what clogging label a particular data pointbelongs to which is something that binary cross-entropy is capable of

                                                                      30

                                                                      34 HARDWARE SPECIFICATIONS

                                                                      Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                                                      34 Hardware Specifications

                                                                      The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                                                      Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                                                      31

                                                                      CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                                      The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                                                      The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                                                      32

                                                                      Chapter 4

                                                                      Results

                                                                      This chapter presents the results for all the models presented in the previous chapter

                                                                      41 LSTM Performance

                                                                      Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                                                      Figure 41 MAE and MSE loss for the LSTM

                                                                      33

                                                                      CHAPTER 4 RESULTS

                                                                      Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                      Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                      Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                      34

                                                                      41 LSTM PERFORMANCE

                                                                      Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                      Table 41 Evaluation metrics for the LSTM during regression analysis

                                                                      Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                                                      Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                                                      Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                                                      35

                                                                      CHAPTER 4 RESULTS

                                                                      Table 42 Evaluation metrics for the LSTM during classification analysis

                                                                      of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                                                      Table 43 LSTM confusion matrix

                                                                      PredictionLabel 1 Label 2

                                                                      Act

                                                                      ual Label 1 109 1

                                                                      Label 2 3 669

                                                                      42 CNN Performance

                                                                      Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                                                      Figure 47 MAE and MSE loss for the CNN

                                                                      36

                                                                      42 CNN PERFORMANCE

                                                                      Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                      Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                      Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                      37

                                                                      CHAPTER 4 RESULTS

                                                                      Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                      Table 44 Evaluation metrics for the CNN during regression analysis

                                                                      Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                                      Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                                      Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                                      38

                                                                      42 CNN PERFORMANCE

                                                                      Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                                      Table 45 Evaluation metrics for the CNN during classification analysis

                                                                      Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                                      Table 46 CNN confusion matrix for data from the MAE regression network

                                                                      PredictionLabel 1 Label 2

                                                                      Act

                                                                      ual Label 1 82 29

                                                                      Label 2 38 631

                                                                      Table 47 CNN confusion matrix for data from the MSE regression network

                                                                      PredictionLabel 1 Label 2

                                                                      Act

                                                                      ual Label 1 69 41

                                                                      Label 2 11 659

                                                                      39

                                                                      Chapter 5

                                                                      Discussion amp Conclusion

                                                                      This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                                      51 The LSTM Network

                                                                      511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                                      Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                                      The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                                      41

                                                                      CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                      while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                                      512 Classification Analysis

                                                                      As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                                      The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                                      52 The CNN

                                                                      521 Regression Analysis

                                                                      The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                                      Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                                      42

                                                                      52 THE CNN

                                                                      is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                                      Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                                      522 Classification Analysis

                                                                      Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                                      Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                                      However the CNN is still doing a good job at predicting future clogging even

                                                                      43

                                                                      CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                      up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                                      53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                                      54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                                      As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                                      44

                                                                      Chapter 6

                                                                      Future Work

                                                                      In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                                      For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                                      On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                                      Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                                      45

                                                                      Bibliography

                                                                      [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                      [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                      [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                      [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                      [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                      [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                      [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                      [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                      [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                      [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                      47

                                                                      BIBLIOGRAPHY

                                                                      [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                      [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                      [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                      [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                      [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                      [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                      [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                      [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                      [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                      [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                      [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                      48

                                                                      BIBLIOGRAPHY

                                                                      [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                      [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                      [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                      [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                      [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                      [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                      [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                      [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                      [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                      [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                      [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                      [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                      49

                                                                      BIBLIOGRAPHY

                                                                      models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                      [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                      [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                      [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                      [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                      [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                      [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                      [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                      [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                      [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                      [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                      50

                                                                      BIBLIOGRAPHY

                                                                      [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                      [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                      [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                      [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                      [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                      [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                      [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                      51

                                                                      TRITA TRITA-ITM-EX 2019606

                                                                      wwwkthse

                                                                      • Introduction
                                                                        • Background
                                                                        • Problem Description
                                                                        • Purpose Definitions amp Research Questions
                                                                        • Scope and Delimitations
                                                                        • Method Description
                                                                          • Frame of Reference
                                                                            • Filtration amp Clogging Indicators
                                                                              • Basket Filter
                                                                              • Self-Cleaning Basket Filters
                                                                              • Manometer
                                                                              • The Clogging Phenomena
                                                                              • Physics-based Modelling
                                                                                • Predictive Analytics
                                                                                  • Classification Error Metrics
                                                                                  • Regression Error Metrics
                                                                                  • Stochastic Time Series Models
                                                                                    • Neural Networks
                                                                                      • Overview
                                                                                      • The Perceptron
                                                                                      • Activation functions
                                                                                      • Neural Network Architectures
                                                                                          • Experimental Development
                                                                                            • Data Gathering and Processing
                                                                                            • Model Generation
                                                                                              • Regression Processing with the LSTM Model
                                                                                              • Regression Processing with the CNN Model
                                                                                              • Label Classification
                                                                                                • Model evaluation
                                                                                                • Hardware Specifications
                                                                                                  • Results
                                                                                                    • LSTM Performance
                                                                                                    • CNN Performance
                                                                                                      • Discussion amp Conclusion
                                                                                                        • The LSTM Network
                                                                                                          • Regression Analysis
                                                                                                          • Classification Analysis
                                                                                                            • The CNN
                                                                                                              • Regression Analysis
                                                                                                              • Classification Analysis
                                                                                                                • Comparison Between Both Networks
                                                                                                                • Conclusion
                                                                                                                  • Future Work
                                                                                                                  • Bibliography

                                                                        34 HARDWARE SPECIFICATIONS

                                                                        Figure 37 Overview of how identical values belong to both class 1 (asterisk) andclass 2 (dot)

                                                                        34 Hardware Specifications

                                                                        The testing rig consists of an external pump that regulates the inflow of waterFollowing the pump is the filter housing containing the basket filter mesh Thefilter housing is equipped with two pressure transducers one before the filter meshand one after the filter mesh The two transducers give the differential pressure overthe filter and the transducer before the filter give the system pressure The flowindicator transmitter that records the system flow rate is mounted on the outletfrom the filter housing The backflush flow meter is connected to the backflushoutlet The inflow of water and the flow through the backflush outlet are bothcontrolled by individual valves

                                                                        Figure 38 An overview of the system and the location of the pressure transducers(PT) flow indicator transmitter (FIT) and flow meter (FM)

                                                                        31

                                                                        CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                                        The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                                                        The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                                                        32

                                                                        Chapter 4

                                                                        Results

                                                                        This chapter presents the results for all the models presented in the previous chapter

                                                                        41 LSTM Performance

                                                                        Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                                                        Figure 41 MAE and MSE loss for the LSTM

                                                                        33

                                                                        CHAPTER 4 RESULTS

                                                                        Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                        Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                        Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                        34

                                                                        41 LSTM PERFORMANCE

                                                                        Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                        Table 41 Evaluation metrics for the LSTM during regression analysis

                                                                        Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                                                        Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                                                        Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                                                        35

                                                                        CHAPTER 4 RESULTS

                                                                        Table 42 Evaluation metrics for the LSTM during classification analysis

                                                                        of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                                                        Table 43 LSTM confusion matrix

                                                                        PredictionLabel 1 Label 2

                                                                        Act

                                                                        ual Label 1 109 1

                                                                        Label 2 3 669

                                                                        42 CNN Performance

                                                                        Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                                                        Figure 47 MAE and MSE loss for the CNN

                                                                        36

                                                                        42 CNN PERFORMANCE

                                                                        Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                        Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                        Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                        37

                                                                        CHAPTER 4 RESULTS

                                                                        Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                        Table 44 Evaluation metrics for the CNN during regression analysis

                                                                        Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                                        Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                                        Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                                        38

                                                                        42 CNN PERFORMANCE

                                                                        Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                                        Table 45 Evaluation metrics for the CNN during classification analysis

                                                                        Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                                        Table 46 CNN confusion matrix for data from the MAE regression network

                                                                        PredictionLabel 1 Label 2

                                                                        Act

                                                                        ual Label 1 82 29

                                                                        Label 2 38 631

                                                                        Table 47 CNN confusion matrix for data from the MSE regression network

                                                                        PredictionLabel 1 Label 2

                                                                        Act

                                                                        ual Label 1 69 41

                                                                        Label 2 11 659

                                                                        39

                                                                        Chapter 5

                                                                        Discussion amp Conclusion

                                                                        This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                                        51 The LSTM Network

                                                                        511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                                        Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                                        The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                                        41

                                                                        CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                        while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                                        512 Classification Analysis

                                                                        As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                                        The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                                        52 The CNN

                                                                        521 Regression Analysis

                                                                        The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                                        Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                                        42

                                                                        52 THE CNN

                                                                        is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                                        Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                                        522 Classification Analysis

                                                                        Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                                        Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                                        However the CNN is still doing a good job at predicting future clogging even

                                                                        43

                                                                        CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                        up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                                        53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                                        54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                                        As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                                        44

                                                                        Chapter 6

                                                                        Future Work

                                                                        In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                                        For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                                        On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                                        Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                                        45

                                                                        Bibliography

                                                                        [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                        [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                        [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                        [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                        [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                        [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                        [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                        [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                        [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                        [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                        47

                                                                        BIBLIOGRAPHY

                                                                        [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                        [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                        [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                        [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                        [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                        [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                        [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                        [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                        [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                        [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                        [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                        48

                                                                        BIBLIOGRAPHY

                                                                        [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                        [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                        [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                        [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                        [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                        [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                        [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                        [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                        [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                        [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                        [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                        [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                        49

                                                                        BIBLIOGRAPHY

                                                                        models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                        [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                        [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                        [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                        [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                        [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                        [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                        [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                        [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                        [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                        [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                        50

                                                                        BIBLIOGRAPHY

                                                                        [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                        [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                        [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                        [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                        [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                        [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                        [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                        51

                                                                        TRITA TRITA-ITM-EX 2019606

                                                                        wwwkthse

                                                                        • Introduction
                                                                          • Background
                                                                          • Problem Description
                                                                          • Purpose Definitions amp Research Questions
                                                                          • Scope and Delimitations
                                                                          • Method Description
                                                                            • Frame of Reference
                                                                              • Filtration amp Clogging Indicators
                                                                                • Basket Filter
                                                                                • Self-Cleaning Basket Filters
                                                                                • Manometer
                                                                                • The Clogging Phenomena
                                                                                • Physics-based Modelling
                                                                                  • Predictive Analytics
                                                                                    • Classification Error Metrics
                                                                                    • Regression Error Metrics
                                                                                    • Stochastic Time Series Models
                                                                                      • Neural Networks
                                                                                        • Overview
                                                                                        • The Perceptron
                                                                                        • Activation functions
                                                                                        • Neural Network Architectures
                                                                                            • Experimental Development
                                                                                              • Data Gathering and Processing
                                                                                              • Model Generation
                                                                                                • Regression Processing with the LSTM Model
                                                                                                • Regression Processing with the CNN Model
                                                                                                • Label Classification
                                                                                                  • Model evaluation
                                                                                                  • Hardware Specifications
                                                                                                    • Results
                                                                                                      • LSTM Performance
                                                                                                      • CNN Performance
                                                                                                        • Discussion amp Conclusion
                                                                                                          • The LSTM Network
                                                                                                            • Regression Analysis
                                                                                                            • Classification Analysis
                                                                                                              • The CNN
                                                                                                                • Regression Analysis
                                                                                                                • Classification Analysis
                                                                                                                  • Comparison Between Both Networks
                                                                                                                  • Conclusion
                                                                                                                    • Future Work
                                                                                                                    • Bibliography

                                                                          CHAPTER 3 EXPERIMENTAL DEVELOPMENT

                                                                          The pressure transducers which measure the differential pressure and the systempressure submit values to the cloud to a preciseness of two decimals The flow in-dicator transmitter which measures system flow rate submits values to a precisionof whole integers and the backflush flow meter does so to a precision of two decimals

                                                                          The simulations and calculations were run on a regular commercial laptop with8192 MB of RAM and an i5-4210M CPU clocked at 260GHz

                                                                          32

                                                                          Chapter 4

                                                                          Results

                                                                          This chapter presents the results for all the models presented in the previous chapter

                                                                          41 LSTM Performance

                                                                          Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                                                          Figure 41 MAE and MSE loss for the LSTM

                                                                          33

                                                                          CHAPTER 4 RESULTS

                                                                          Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                          Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                          Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                          34

                                                                          41 LSTM PERFORMANCE

                                                                          Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                          Table 41 Evaluation metrics for the LSTM during regression analysis

                                                                          Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                                                          Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                                                          Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                                                          35

                                                                          CHAPTER 4 RESULTS

                                                                          Table 42 Evaluation metrics for the LSTM during classification analysis

                                                                          of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                                                          Table 43 LSTM confusion matrix

                                                                          PredictionLabel 1 Label 2

                                                                          Act

                                                                          ual Label 1 109 1

                                                                          Label 2 3 669

                                                                          42 CNN Performance

                                                                          Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                                                          Figure 47 MAE and MSE loss for the CNN

                                                                          36

                                                                          42 CNN PERFORMANCE

                                                                          Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                          Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                          Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                          37

                                                                          CHAPTER 4 RESULTS

                                                                          Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                          Table 44 Evaluation metrics for the CNN during regression analysis

                                                                          Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                                          Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                                          Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                                          38

                                                                          42 CNN PERFORMANCE

                                                                          Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                                          Table 45 Evaluation metrics for the CNN during classification analysis

                                                                          Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                                          Table 46 CNN confusion matrix for data from the MAE regression network

                                                                          PredictionLabel 1 Label 2

                                                                          Act

                                                                          ual Label 1 82 29

                                                                          Label 2 38 631

                                                                          Table 47 CNN confusion matrix for data from the MSE regression network

                                                                          PredictionLabel 1 Label 2

                                                                          Act

                                                                          ual Label 1 69 41

                                                                          Label 2 11 659

                                                                          39

                                                                          Chapter 5

                                                                          Discussion amp Conclusion

                                                                          This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                                          51 The LSTM Network

                                                                          511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                                          Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                                          The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                                          41

                                                                          CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                          while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                                          512 Classification Analysis

                                                                          As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                                          The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                                          52 The CNN

                                                                          521 Regression Analysis

                                                                          The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                                          Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                                          42

                                                                          52 THE CNN

                                                                          is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                                          Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                                          522 Classification Analysis

                                                                          Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                                          Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                                          However the CNN is still doing a good job at predicting future clogging even

                                                                          43

                                                                          CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                          up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                                          53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                                          54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                                          As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                                          44

                                                                          Chapter 6

                                                                          Future Work

                                                                          In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                                          For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                                          On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                                          Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                                          45

                                                                          Bibliography

                                                                          [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                          [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                          [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                          [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                          [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                          [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                          [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                          [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                          [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                          [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                          47

                                                                          BIBLIOGRAPHY

                                                                          [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                          [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                          [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                          [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                          [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                          [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                          [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                          [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                          [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                          [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                          [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                          48

                                                                          BIBLIOGRAPHY

                                                                          [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                          [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                          [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                          [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                          [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                          [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                          [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                          [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                          [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                          [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                          [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                          [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                          49

                                                                          BIBLIOGRAPHY

                                                                          models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                          [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                          [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                          [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                          [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                          [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                          [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                          [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                          [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                          [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                          [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                          50

                                                                          BIBLIOGRAPHY

                                                                          [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                          [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                          [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                          [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                          [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                          [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                          [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                          51

                                                                          TRITA TRITA-ITM-EX 2019606

                                                                          wwwkthse

                                                                          • Introduction
                                                                            • Background
                                                                            • Problem Description
                                                                            • Purpose Definitions amp Research Questions
                                                                            • Scope and Delimitations
                                                                            • Method Description
                                                                              • Frame of Reference
                                                                                • Filtration amp Clogging Indicators
                                                                                  • Basket Filter
                                                                                  • Self-Cleaning Basket Filters
                                                                                  • Manometer
                                                                                  • The Clogging Phenomena
                                                                                  • Physics-based Modelling
                                                                                    • Predictive Analytics
                                                                                      • Classification Error Metrics
                                                                                      • Regression Error Metrics
                                                                                      • Stochastic Time Series Models
                                                                                        • Neural Networks
                                                                                          • Overview
                                                                                          • The Perceptron
                                                                                          • Activation functions
                                                                                          • Neural Network Architectures
                                                                                              • Experimental Development
                                                                                                • Data Gathering and Processing
                                                                                                • Model Generation
                                                                                                  • Regression Processing with the LSTM Model
                                                                                                  • Regression Processing with the CNN Model
                                                                                                  • Label Classification
                                                                                                    • Model evaluation
                                                                                                    • Hardware Specifications
                                                                                                      • Results
                                                                                                        • LSTM Performance
                                                                                                        • CNN Performance
                                                                                                          • Discussion amp Conclusion
                                                                                                            • The LSTM Network
                                                                                                              • Regression Analysis
                                                                                                              • Classification Analysis
                                                                                                                • The CNN
                                                                                                                  • Regression Analysis
                                                                                                                  • Classification Analysis
                                                                                                                    • Comparison Between Both Networks
                                                                                                                    • Conclusion
                                                                                                                      • Future Work
                                                                                                                      • Bibliography

                                                                            Chapter 4

                                                                            Results

                                                                            This chapter presents the results for all the models presented in the previous chapter

                                                                            41 LSTM Performance

                                                                            Figure 41 shows the respective loss for both functions when training the networkon regression Figures 42 and 43 show the predicted values against the actualvalues for the MAE loss function and Figures 44 and 45 show the predicted valuesagainst the actual values for the MSE loss function Table 41 shows the values ofa number of tested regression metrics

                                                                            Figure 41 MAE and MSE loss for the LSTM

                                                                            33

                                                                            CHAPTER 4 RESULTS

                                                                            Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                            Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                            Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                            34

                                                                            41 LSTM PERFORMANCE

                                                                            Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                            Table 41 Evaluation metrics for the LSTM during regression analysis

                                                                            Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                                                            Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                                                            Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                                                            35

                                                                            CHAPTER 4 RESULTS

                                                                            Table 42 Evaluation metrics for the LSTM during classification analysis

                                                                            of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                                                            Table 43 LSTM confusion matrix

                                                                            PredictionLabel 1 Label 2

                                                                            Act

                                                                            ual Label 1 109 1

                                                                            Label 2 3 669

                                                                            42 CNN Performance

                                                                            Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                                                            Figure 47 MAE and MSE loss for the CNN

                                                                            36

                                                                            42 CNN PERFORMANCE

                                                                            Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                            Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                            Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                            37

                                                                            CHAPTER 4 RESULTS

                                                                            Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                            Table 44 Evaluation metrics for the CNN during regression analysis

                                                                            Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                                            Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                                            Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                                            38

                                                                            42 CNN PERFORMANCE

                                                                            Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                                            Table 45 Evaluation metrics for the CNN during classification analysis

                                                                            Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                                            Table 46 CNN confusion matrix for data from the MAE regression network

                                                                            PredictionLabel 1 Label 2

                                                                            Act

                                                                            ual Label 1 82 29

                                                                            Label 2 38 631

                                                                            Table 47 CNN confusion matrix for data from the MSE regression network

                                                                            PredictionLabel 1 Label 2

                                                                            Act

                                                                            ual Label 1 69 41

                                                                            Label 2 11 659

                                                                            39

                                                                            Chapter 5

                                                                            Discussion amp Conclusion

                                                                            This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                                            51 The LSTM Network

                                                                            511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                                            Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                                            The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                                            41

                                                                            CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                            while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                                            512 Classification Analysis

                                                                            As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                                            The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                                            52 The CNN

                                                                            521 Regression Analysis

                                                                            The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                                            Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                                            42

                                                                            52 THE CNN

                                                                            is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                                            Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                                            522 Classification Analysis

                                                                            Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                                            Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                                            However the CNN is still doing a good job at predicting future clogging even

                                                                            43

                                                                            CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                            up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                                            53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                                            54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                                            As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                                            44

                                                                            Chapter 6

                                                                            Future Work

                                                                            In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                                            For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                                            On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                                            Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                                            45

                                                                            Bibliography

                                                                            [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                            [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                            [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                            [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                            [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                            [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                            [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                            [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                            [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                            [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                            47

                                                                            BIBLIOGRAPHY

                                                                            [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                            [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                            [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                            [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                            [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                            [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                            [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                            [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                            [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                            [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                            [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                            48

                                                                            BIBLIOGRAPHY

                                                                            [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                            [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                            [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                            [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                            [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                            [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                            [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                            [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                            [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                            [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                            [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                            [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                            49

                                                                            BIBLIOGRAPHY

                                                                            models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                            [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                            [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                            [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                            [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                            [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                            [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                            [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                            [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                            [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                            [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                            50

                                                                            BIBLIOGRAPHY

                                                                            [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                            [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                            [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                            [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                            [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                            [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                            [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                            51

                                                                            TRITA TRITA-ITM-EX 2019606

                                                                            wwwkthse

                                                                            • Introduction
                                                                              • Background
                                                                              • Problem Description
                                                                              • Purpose Definitions amp Research Questions
                                                                              • Scope and Delimitations
                                                                              • Method Description
                                                                                • Frame of Reference
                                                                                  • Filtration amp Clogging Indicators
                                                                                    • Basket Filter
                                                                                    • Self-Cleaning Basket Filters
                                                                                    • Manometer
                                                                                    • The Clogging Phenomena
                                                                                    • Physics-based Modelling
                                                                                      • Predictive Analytics
                                                                                        • Classification Error Metrics
                                                                                        • Regression Error Metrics
                                                                                        • Stochastic Time Series Models
                                                                                          • Neural Networks
                                                                                            • Overview
                                                                                            • The Perceptron
                                                                                            • Activation functions
                                                                                            • Neural Network Architectures
                                                                                                • Experimental Development
                                                                                                  • Data Gathering and Processing
                                                                                                  • Model Generation
                                                                                                    • Regression Processing with the LSTM Model
                                                                                                    • Regression Processing with the CNN Model
                                                                                                    • Label Classification
                                                                                                      • Model evaluation
                                                                                                      • Hardware Specifications
                                                                                                        • Results
                                                                                                          • LSTM Performance
                                                                                                          • CNN Performance
                                                                                                            • Discussion amp Conclusion
                                                                                                              • The LSTM Network
                                                                                                                • Regression Analysis
                                                                                                                • Classification Analysis
                                                                                                                  • The CNN
                                                                                                                    • Regression Analysis
                                                                                                                    • Classification Analysis
                                                                                                                      • Comparison Between Both Networks
                                                                                                                      • Conclusion
                                                                                                                        • Future Work
                                                                                                                        • Bibliography

                                                                              CHAPTER 4 RESULTS

                                                                              Figure 42 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                              Figure 43 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                              Figure 44 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                              34

                                                                              41 LSTM PERFORMANCE

                                                                              Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                              Table 41 Evaluation metrics for the LSTM during regression analysis

                                                                              Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                                                              Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                                                              Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                                                              35

                                                                              CHAPTER 4 RESULTS

                                                                              Table 42 Evaluation metrics for the LSTM during classification analysis

                                                                              of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                                                              Table 43 LSTM confusion matrix

                                                                              PredictionLabel 1 Label 2

                                                                              Act

                                                                              ual Label 1 109 1

                                                                              Label 2 3 669

                                                                              42 CNN Performance

                                                                              Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                                                              Figure 47 MAE and MSE loss for the CNN

                                                                              36

                                                                              42 CNN PERFORMANCE

                                                                              Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                              Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                              Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                              37

                                                                              CHAPTER 4 RESULTS

                                                                              Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                              Table 44 Evaluation metrics for the CNN during regression analysis

                                                                              Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                                              Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                                              Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                                              38

                                                                              42 CNN PERFORMANCE

                                                                              Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                                              Table 45 Evaluation metrics for the CNN during classification analysis

                                                                              Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                                              Table 46 CNN confusion matrix for data from the MAE regression network

                                                                              PredictionLabel 1 Label 2

                                                                              Act

                                                                              ual Label 1 82 29

                                                                              Label 2 38 631

                                                                              Table 47 CNN confusion matrix for data from the MSE regression network

                                                                              PredictionLabel 1 Label 2

                                                                              Act

                                                                              ual Label 1 69 41

                                                                              Label 2 11 659

                                                                              39

                                                                              Chapter 5

                                                                              Discussion amp Conclusion

                                                                              This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                                              51 The LSTM Network

                                                                              511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                                              Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                                              The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                                              41

                                                                              CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                              while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                                              512 Classification Analysis

                                                                              As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                                              The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                                              52 The CNN

                                                                              521 Regression Analysis

                                                                              The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                                              Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                                              42

                                                                              52 THE CNN

                                                                              is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                                              Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                                              522 Classification Analysis

                                                                              Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                                              Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                                              However the CNN is still doing a good job at predicting future clogging even

                                                                              43

                                                                              CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                              up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                                              53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                                              54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                                              As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                                              44

                                                                              Chapter 6

                                                                              Future Work

                                                                              In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                                              For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                                              On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                                              Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                                              45

                                                                              Bibliography

                                                                              [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                              [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                              [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                              [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                              [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                              [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                              [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                              [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                              [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                              [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                              47

                                                                              BIBLIOGRAPHY

                                                                              [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                              [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                              [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                              [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                              [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                              [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                              [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                              [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                              [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                              [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                              [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                              48

                                                                              BIBLIOGRAPHY

                                                                              [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                              [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                              [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                              [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                              [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                              [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                              [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                              [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                              [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                              [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                              [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                              [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                              49

                                                                              BIBLIOGRAPHY

                                                                              models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                              [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                              [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                              [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                              [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                              [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                              [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                              [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                              [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                              [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                              [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                              50

                                                                              BIBLIOGRAPHY

                                                                              [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                              [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                              [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                              [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                              [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                              [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                              [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                              51

                                                                              TRITA TRITA-ITM-EX 2019606

                                                                              wwwkthse

                                                                              • Introduction
                                                                                • Background
                                                                                • Problem Description
                                                                                • Purpose Definitions amp Research Questions
                                                                                • Scope and Delimitations
                                                                                • Method Description
                                                                                  • Frame of Reference
                                                                                    • Filtration amp Clogging Indicators
                                                                                      • Basket Filter
                                                                                      • Self-Cleaning Basket Filters
                                                                                      • Manometer
                                                                                      • The Clogging Phenomena
                                                                                      • Physics-based Modelling
                                                                                        • Predictive Analytics
                                                                                          • Classification Error Metrics
                                                                                          • Regression Error Metrics
                                                                                          • Stochastic Time Series Models
                                                                                            • Neural Networks
                                                                                              • Overview
                                                                                              • The Perceptron
                                                                                              • Activation functions
                                                                                              • Neural Network Architectures
                                                                                                  • Experimental Development
                                                                                                    • Data Gathering and Processing
                                                                                                    • Model Generation
                                                                                                      • Regression Processing with the LSTM Model
                                                                                                      • Regression Processing with the CNN Model
                                                                                                      • Label Classification
                                                                                                        • Model evaluation
                                                                                                        • Hardware Specifications
                                                                                                          • Results
                                                                                                            • LSTM Performance
                                                                                                            • CNN Performance
                                                                                                              • Discussion amp Conclusion
                                                                                                                • The LSTM Network
                                                                                                                  • Regression Analysis
                                                                                                                  • Classification Analysis
                                                                                                                    • The CNN
                                                                                                                      • Regression Analysis
                                                                                                                      • Classification Analysis
                                                                                                                        • Comparison Between Both Networks
                                                                                                                        • Conclusion
                                                                                                                          • Future Work
                                                                                                                          • Bibliography

                                                                                41 LSTM PERFORMANCE

                                                                                Figure 45 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                                Table 41 Evaluation metrics for the LSTM during regression analysis

                                                                                Loss function of epochs MSE RMSE R2 MAEMAE 738 0001 0029 0981 0016MSE 665 0014 0119 0694 0032

                                                                                Figure 46 shows the change in binary cross-entropy loss and classification accuracyover each epoch for the LSTM when training the network using the binary cross-entropy loss function Table 42 contains the values of a number of classificationerror metrics and Table 43 contains the confusion matrix for classification on thedataset

                                                                                Figure 46 Binary cross-entropy loss and classification accuracy for the LSTM

                                                                                35

                                                                                CHAPTER 4 RESULTS

                                                                                Table 42 Evaluation metrics for the LSTM during classification analysis

                                                                                of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                                                                Table 43 LSTM confusion matrix

                                                                                PredictionLabel 1 Label 2

                                                                                Act

                                                                                ual Label 1 109 1

                                                                                Label 2 3 669

                                                                                42 CNN Performance

                                                                                Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                                                                Figure 47 MAE and MSE loss for the CNN

                                                                                36

                                                                                42 CNN PERFORMANCE

                                                                                Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                                Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                                Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                                37

                                                                                CHAPTER 4 RESULTS

                                                                                Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                                Table 44 Evaluation metrics for the CNN during regression analysis

                                                                                Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                                                Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                                                Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                                                38

                                                                                42 CNN PERFORMANCE

                                                                                Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                                                Table 45 Evaluation metrics for the CNN during classification analysis

                                                                                Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                                                Table 46 CNN confusion matrix for data from the MAE regression network

                                                                                PredictionLabel 1 Label 2

                                                                                Act

                                                                                ual Label 1 82 29

                                                                                Label 2 38 631

                                                                                Table 47 CNN confusion matrix for data from the MSE regression network

                                                                                PredictionLabel 1 Label 2

                                                                                Act

                                                                                ual Label 1 69 41

                                                                                Label 2 11 659

                                                                                39

                                                                                Chapter 5

                                                                                Discussion amp Conclusion

                                                                                This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                                                51 The LSTM Network

                                                                                511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                                                Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                                                The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                                                41

                                                                                CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                                while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                                                512 Classification Analysis

                                                                                As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                                                The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                                                52 The CNN

                                                                                521 Regression Analysis

                                                                                The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                                                Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                                                42

                                                                                52 THE CNN

                                                                                is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                                                Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                                                522 Classification Analysis

                                                                                Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                                                Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                                                However the CNN is still doing a good job at predicting future clogging even

                                                                                43

                                                                                CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                                up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                                                53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                                                54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                                                As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                                                44

                                                                                Chapter 6

                                                                                Future Work

                                                                                In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                                                For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                                                On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                                                Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                                                45

                                                                                Bibliography

                                                                                [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                                [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                                [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                                [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                                [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                                [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                                [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                                [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                                [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                                [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                                47

                                                                                BIBLIOGRAPHY

                                                                                [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                                [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                                [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                                [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                                [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                                [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                                [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                                [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                                [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                                [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                48

                                                                                BIBLIOGRAPHY

                                                                                [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                                [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                                [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                                [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                                [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                                [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                                [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                                [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                                [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                                [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                                [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                                [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                                49

                                                                                BIBLIOGRAPHY

                                                                                models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                                [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                                [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                                [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                                [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                                [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                                [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                                [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                                [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                                [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                                [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                                50

                                                                                BIBLIOGRAPHY

                                                                                [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                                [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                                [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                                [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                                [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                                [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                                [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                                51

                                                                                TRITA TRITA-ITM-EX 2019606

                                                                                wwwkthse

                                                                                • Introduction
                                                                                  • Background
                                                                                  • Problem Description
                                                                                  • Purpose Definitions amp Research Questions
                                                                                  • Scope and Delimitations
                                                                                  • Method Description
                                                                                    • Frame of Reference
                                                                                      • Filtration amp Clogging Indicators
                                                                                        • Basket Filter
                                                                                        • Self-Cleaning Basket Filters
                                                                                        • Manometer
                                                                                        • The Clogging Phenomena
                                                                                        • Physics-based Modelling
                                                                                          • Predictive Analytics
                                                                                            • Classification Error Metrics
                                                                                            • Regression Error Metrics
                                                                                            • Stochastic Time Series Models
                                                                                              • Neural Networks
                                                                                                • Overview
                                                                                                • The Perceptron
                                                                                                • Activation functions
                                                                                                • Neural Network Architectures
                                                                                                    • Experimental Development
                                                                                                      • Data Gathering and Processing
                                                                                                      • Model Generation
                                                                                                        • Regression Processing with the LSTM Model
                                                                                                        • Regression Processing with the CNN Model
                                                                                                        • Label Classification
                                                                                                          • Model evaluation
                                                                                                          • Hardware Specifications
                                                                                                            • Results
                                                                                                              • LSTM Performance
                                                                                                              • CNN Performance
                                                                                                                • Discussion amp Conclusion
                                                                                                                  • The LSTM Network
                                                                                                                    • Regression Analysis
                                                                                                                    • Classification Analysis
                                                                                                                      • The CNN
                                                                                                                        • Regression Analysis
                                                                                                                        • Classification Analysis
                                                                                                                          • Comparison Between Both Networks
                                                                                                                          • Conclusion
                                                                                                                            • Future Work
                                                                                                                            • Bibliography

                                                                                  CHAPTER 4 RESULTS

                                                                                  Table 42 Evaluation metrics for the LSTM during classification analysis

                                                                                  of epochs Accuracy ROC F1 log-loss190 995 0993 0995 0082

                                                                                  Table 43 LSTM confusion matrix

                                                                                  PredictionLabel 1 Label 2

                                                                                  Act

                                                                                  ual Label 1 109 1

                                                                                  Label 2 3 669

                                                                                  42 CNN Performance

                                                                                  Figure 47 shows the respective loss for both functions when training the network onregression Figures 48 and 49 show the predicted values against the actual valuesfor the MAE loss function and Figures 410 and 411 show the predicted valuesagainst the actual values for the MSE loss function Table 44 shows the values ofa number of tested regression metrics

                                                                                  Figure 47 MAE and MSE loss for the CNN

                                                                                  36

                                                                                  42 CNN PERFORMANCE

                                                                                  Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                                  Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                                  Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                                  37

                                                                                  CHAPTER 4 RESULTS

                                                                                  Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                                  Table 44 Evaluation metrics for the CNN during regression analysis

                                                                                  Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                                                  Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                                                  Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                                                  38

                                                                                  42 CNN PERFORMANCE

                                                                                  Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                                                  Table 45 Evaluation metrics for the CNN during classification analysis

                                                                                  Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                                                  Table 46 CNN confusion matrix for data from the MAE regression network

                                                                                  PredictionLabel 1 Label 2

                                                                                  Act

                                                                                  ual Label 1 82 29

                                                                                  Label 2 38 631

                                                                                  Table 47 CNN confusion matrix for data from the MSE regression network

                                                                                  PredictionLabel 1 Label 2

                                                                                  Act

                                                                                  ual Label 1 69 41

                                                                                  Label 2 11 659

                                                                                  39

                                                                                  Chapter 5

                                                                                  Discussion amp Conclusion

                                                                                  This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                                                  51 The LSTM Network

                                                                                  511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                                                  Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                                                  The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                                                  41

                                                                                  CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                                  while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                                                  512 Classification Analysis

                                                                                  As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                                                  The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                                                  52 The CNN

                                                                                  521 Regression Analysis

                                                                                  The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                                                  Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                                                  42

                                                                                  52 THE CNN

                                                                                  is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                                                  Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                                                  522 Classification Analysis

                                                                                  Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                                                  Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                                                  However the CNN is still doing a good job at predicting future clogging even

                                                                                  43

                                                                                  CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                                  up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                                                  53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                                                  54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                                                  As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                                                  44

                                                                                  Chapter 6

                                                                                  Future Work

                                                                                  In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                                                  For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                                                  On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                                                  Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                                                  45

                                                                                  Bibliography

                                                                                  [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                                  [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                                  [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                                  [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                                  [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                                  [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                                  [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                                  [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                                  [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                                  [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                                  47

                                                                                  BIBLIOGRAPHY

                                                                                  [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                                  [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                                  [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                                  [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                                  [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                                  [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                                  [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                                  [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                                  [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                                  [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                  [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                  48

                                                                                  BIBLIOGRAPHY

                                                                                  [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                                  [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                                  [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                                  [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                                  [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                                  [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                                  [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                                  [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                                  [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                                  [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                                  [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                                  [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                                  49

                                                                                  BIBLIOGRAPHY

                                                                                  models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                                  [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                                  [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                                  [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                                  [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                                  [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                                  [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                                  [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                                  [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                                  [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                                  [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                                  50

                                                                                  BIBLIOGRAPHY

                                                                                  [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                                  [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                                  [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                                  [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                                  [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                                  [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                                  [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                                  51

                                                                                  TRITA TRITA-ITM-EX 2019606

                                                                                  wwwkthse

                                                                                  • Introduction
                                                                                    • Background
                                                                                    • Problem Description
                                                                                    • Purpose Definitions amp Research Questions
                                                                                    • Scope and Delimitations
                                                                                    • Method Description
                                                                                      • Frame of Reference
                                                                                        • Filtration amp Clogging Indicators
                                                                                          • Basket Filter
                                                                                          • Self-Cleaning Basket Filters
                                                                                          • Manometer
                                                                                          • The Clogging Phenomena
                                                                                          • Physics-based Modelling
                                                                                            • Predictive Analytics
                                                                                              • Classification Error Metrics
                                                                                              • Regression Error Metrics
                                                                                              • Stochastic Time Series Models
                                                                                                • Neural Networks
                                                                                                  • Overview
                                                                                                  • The Perceptron
                                                                                                  • Activation functions
                                                                                                  • Neural Network Architectures
                                                                                                      • Experimental Development
                                                                                                        • Data Gathering and Processing
                                                                                                        • Model Generation
                                                                                                          • Regression Processing with the LSTM Model
                                                                                                          • Regression Processing with the CNN Model
                                                                                                          • Label Classification
                                                                                                            • Model evaluation
                                                                                                            • Hardware Specifications
                                                                                                              • Results
                                                                                                                • LSTM Performance
                                                                                                                • CNN Performance
                                                                                                                  • Discussion amp Conclusion
                                                                                                                    • The LSTM Network
                                                                                                                      • Regression Analysis
                                                                                                                      • Classification Analysis
                                                                                                                        • The CNN
                                                                                                                          • Regression Analysis
                                                                                                                          • Classification Analysis
                                                                                                                            • Comparison Between Both Networks
                                                                                                                            • Conclusion
                                                                                                                              • Future Work
                                                                                                                              • Bibliography

                                                                                    42 CNN PERFORMANCE

                                                                                    Figure 48 Predicted vs actual differential pressure and system flow rate using theMAE loss function

                                                                                    Figure 49 Predicted vs actual system pressure and backflush flow rate using theMAE loss function

                                                                                    Figure 410 Predicted vs actual differential pressure and system flow rate using theMSE loss function

                                                                                    37

                                                                                    CHAPTER 4 RESULTS

                                                                                    Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                                    Table 44 Evaluation metrics for the CNN during regression analysis

                                                                                    Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                                                    Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                                                    Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                                                    38

                                                                                    42 CNN PERFORMANCE

                                                                                    Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                                                    Table 45 Evaluation metrics for the CNN during classification analysis

                                                                                    Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                                                    Table 46 CNN confusion matrix for data from the MAE regression network

                                                                                    PredictionLabel 1 Label 2

                                                                                    Act

                                                                                    ual Label 1 82 29

                                                                                    Label 2 38 631

                                                                                    Table 47 CNN confusion matrix for data from the MSE regression network

                                                                                    PredictionLabel 1 Label 2

                                                                                    Act

                                                                                    ual Label 1 69 41

                                                                                    Label 2 11 659

                                                                                    39

                                                                                    Chapter 5

                                                                                    Discussion amp Conclusion

                                                                                    This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                                                    51 The LSTM Network

                                                                                    511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                                                    Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                                                    The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                                                    41

                                                                                    CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                                    while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                                                    512 Classification Analysis

                                                                                    As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                                                    The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                                                    52 The CNN

                                                                                    521 Regression Analysis

                                                                                    The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                                                    Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                                                    42

                                                                                    52 THE CNN

                                                                                    is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                                                    Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                                                    522 Classification Analysis

                                                                                    Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                                                    Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                                                    However the CNN is still doing a good job at predicting future clogging even

                                                                                    43

                                                                                    CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                                    up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                                                    53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                                                    54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                                                    As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                                                    44

                                                                                    Chapter 6

                                                                                    Future Work

                                                                                    In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                                                    For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                                                    On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                                                    Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                                                    45

                                                                                    Bibliography

                                                                                    [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                                    [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                                    [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                                    [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                                    [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                                    [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                                    [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                                    [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                                    [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                                    [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                                    47

                                                                                    BIBLIOGRAPHY

                                                                                    [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                                    [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                                    [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                                    [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                                    [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                                    [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                                    [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                                    [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                                    [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                                    [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                    [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                    48

                                                                                    BIBLIOGRAPHY

                                                                                    [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                                    [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                                    [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                                    [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                                    [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                                    [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                                    [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                                    [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                                    [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                                    [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                                    [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                                    [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                                    49

                                                                                    BIBLIOGRAPHY

                                                                                    models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                                    [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                                    [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                                    [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                                    [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                                    [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                                    [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                                    [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                                    [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                                    [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                                    [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                                    50

                                                                                    BIBLIOGRAPHY

                                                                                    [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                                    [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                                    [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                                    [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                                    [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                                    [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                                    [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                                    51

                                                                                    TRITA TRITA-ITM-EX 2019606

                                                                                    wwwkthse

                                                                                    • Introduction
                                                                                      • Background
                                                                                      • Problem Description
                                                                                      • Purpose Definitions amp Research Questions
                                                                                      • Scope and Delimitations
                                                                                      • Method Description
                                                                                        • Frame of Reference
                                                                                          • Filtration amp Clogging Indicators
                                                                                            • Basket Filter
                                                                                            • Self-Cleaning Basket Filters
                                                                                            • Manometer
                                                                                            • The Clogging Phenomena
                                                                                            • Physics-based Modelling
                                                                                              • Predictive Analytics
                                                                                                • Classification Error Metrics
                                                                                                • Regression Error Metrics
                                                                                                • Stochastic Time Series Models
                                                                                                  • Neural Networks
                                                                                                    • Overview
                                                                                                    • The Perceptron
                                                                                                    • Activation functions
                                                                                                    • Neural Network Architectures
                                                                                                        • Experimental Development
                                                                                                          • Data Gathering and Processing
                                                                                                          • Model Generation
                                                                                                            • Regression Processing with the LSTM Model
                                                                                                            • Regression Processing with the CNN Model
                                                                                                            • Label Classification
                                                                                                              • Model evaluation
                                                                                                              • Hardware Specifications
                                                                                                                • Results
                                                                                                                  • LSTM Performance
                                                                                                                  • CNN Performance
                                                                                                                    • Discussion amp Conclusion
                                                                                                                      • The LSTM Network
                                                                                                                        • Regression Analysis
                                                                                                                        • Classification Analysis
                                                                                                                          • The CNN
                                                                                                                            • Regression Analysis
                                                                                                                            • Classification Analysis
                                                                                                                              • Comparison Between Both Networks
                                                                                                                              • Conclusion
                                                                                                                                • Future Work
                                                                                                                                • Bibliography

                                                                                      CHAPTER 4 RESULTS

                                                                                      Figure 411 Predicted vs actual system pressure and backflush flow rate using theMSE loss function

                                                                                      Table 44 Evaluation metrics for the CNN during regression analysis

                                                                                      Loss function of epochs MSE RMSE R2 MAEMAE 756 0007 0086 0876 0025MSE 458 0008 0092 0843 0037

                                                                                      Figures 412 and 413 show the change in binary cross-entropy loss and classifica-tion accuracy over each epoch for the CNN when training the network using thebinary cross-entropy loss function Table 45 contains the values of a number ofclassification error metrics and Tables 46 and 47 contains the confusion matricesfor classification on both datasets from MAE and MSE regression

                                                                                      Figure 412 Binary cross-entropy loss and classification accuracy for the CNN usingMAE regression data

                                                                                      38

                                                                                      42 CNN PERFORMANCE

                                                                                      Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                                                      Table 45 Evaluation metrics for the CNN during classification analysis

                                                                                      Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                                                      Table 46 CNN confusion matrix for data from the MAE regression network

                                                                                      PredictionLabel 1 Label 2

                                                                                      Act

                                                                                      ual Label 1 82 29

                                                                                      Label 2 38 631

                                                                                      Table 47 CNN confusion matrix for data from the MSE regression network

                                                                                      PredictionLabel 1 Label 2

                                                                                      Act

                                                                                      ual Label 1 69 41

                                                                                      Label 2 11 659

                                                                                      39

                                                                                      Chapter 5

                                                                                      Discussion amp Conclusion

                                                                                      This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                                                      51 The LSTM Network

                                                                                      511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                                                      Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                                                      The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                                                      41

                                                                                      CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                                      while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                                                      512 Classification Analysis

                                                                                      As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                                                      The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                                                      52 The CNN

                                                                                      521 Regression Analysis

                                                                                      The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                                                      Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                                                      42

                                                                                      52 THE CNN

                                                                                      is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                                                      Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                                                      522 Classification Analysis

                                                                                      Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                                                      Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                                                      However the CNN is still doing a good job at predicting future clogging even

                                                                                      43

                                                                                      CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                                      up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                                                      53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                                                      54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                                                      As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                                                      44

                                                                                      Chapter 6

                                                                                      Future Work

                                                                                      In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                                                      For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                                                      On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                                                      Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                                                      45

                                                                                      Bibliography

                                                                                      [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                                      [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                                      [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                                      [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                                      [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                                      [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                                      [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                                      [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                                      [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                                      [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                                      47

                                                                                      BIBLIOGRAPHY

                                                                                      [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                                      [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                                      [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                                      [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                                      [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                                      [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                                      [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                                      [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                                      [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                                      [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                      [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                      48

                                                                                      BIBLIOGRAPHY

                                                                                      [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                                      [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                                      [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                                      [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                                      [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                                      [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                                      [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                                      [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                                      [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                                      [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                                      [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                                      [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                                      49

                                                                                      BIBLIOGRAPHY

                                                                                      models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                                      [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                                      [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                                      [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                                      [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                                      [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                                      [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                                      [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                                      [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                                      [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                                      [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                                      50

                                                                                      BIBLIOGRAPHY

                                                                                      [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                                      [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                                      [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                                      [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                                      [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                                      [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                                      [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                                      51

                                                                                      TRITA TRITA-ITM-EX 2019606

                                                                                      wwwkthse

                                                                                      • Introduction
                                                                                        • Background
                                                                                        • Problem Description
                                                                                        • Purpose Definitions amp Research Questions
                                                                                        • Scope and Delimitations
                                                                                        • Method Description
                                                                                          • Frame of Reference
                                                                                            • Filtration amp Clogging Indicators
                                                                                              • Basket Filter
                                                                                              • Self-Cleaning Basket Filters
                                                                                              • Manometer
                                                                                              • The Clogging Phenomena
                                                                                              • Physics-based Modelling
                                                                                                • Predictive Analytics
                                                                                                  • Classification Error Metrics
                                                                                                  • Regression Error Metrics
                                                                                                  • Stochastic Time Series Models
                                                                                                    • Neural Networks
                                                                                                      • Overview
                                                                                                      • The Perceptron
                                                                                                      • Activation functions
                                                                                                      • Neural Network Architectures
                                                                                                          • Experimental Development
                                                                                                            • Data Gathering and Processing
                                                                                                            • Model Generation
                                                                                                              • Regression Processing with the LSTM Model
                                                                                                              • Regression Processing with the CNN Model
                                                                                                              • Label Classification
                                                                                                                • Model evaluation
                                                                                                                • Hardware Specifications
                                                                                                                  • Results
                                                                                                                    • LSTM Performance
                                                                                                                    • CNN Performance
                                                                                                                      • Discussion amp Conclusion
                                                                                                                        • The LSTM Network
                                                                                                                          • Regression Analysis
                                                                                                                          • Classification Analysis
                                                                                                                            • The CNN
                                                                                                                              • Regression Analysis
                                                                                                                              • Classification Analysis
                                                                                                                                • Comparison Between Both Networks
                                                                                                                                • Conclusion
                                                                                                                                  • Future Work
                                                                                                                                  • Bibliography

                                                                                        42 CNN PERFORMANCE

                                                                                        Figure 413 Binary cross-entropy loss and classification accuracy for the CNN usingMSE regression data

                                                                                        Table 45 Evaluation metrics for the CNN during classification analysis

                                                                                        Regression network of epochs Accuracy AUC F1 log-lossMAE 1203 914 0826 0907 301MSE 1195 933 0791 0926 26027

                                                                                        Table 46 CNN confusion matrix for data from the MAE regression network

                                                                                        PredictionLabel 1 Label 2

                                                                                        Act

                                                                                        ual Label 1 82 29

                                                                                        Label 2 38 631

                                                                                        Table 47 CNN confusion matrix for data from the MSE regression network

                                                                                        PredictionLabel 1 Label 2

                                                                                        Act

                                                                                        ual Label 1 69 41

                                                                                        Label 2 11 659

                                                                                        39

                                                                                        Chapter 5

                                                                                        Discussion amp Conclusion

                                                                                        This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                                                        51 The LSTM Network

                                                                                        511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                                                        Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                                                        The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                                                        41

                                                                                        CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                                        while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                                                        512 Classification Analysis

                                                                                        As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                                                        The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                                                        52 The CNN

                                                                                        521 Regression Analysis

                                                                                        The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                                                        Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                                                        42

                                                                                        52 THE CNN

                                                                                        is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                                                        Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                                                        522 Classification Analysis

                                                                                        Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                                                        Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                                                        However the CNN is still doing a good job at predicting future clogging even

                                                                                        43

                                                                                        CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                                        up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                                                        53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                                                        54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                                                        As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                                                        44

                                                                                        Chapter 6

                                                                                        Future Work

                                                                                        In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                                                        For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                                                        On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                                                        Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                                                        45

                                                                                        Bibliography

                                                                                        [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                                        [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                                        [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                                        [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                                        [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                                        [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                                        [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                                        [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                                        [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                                        [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                                        47

                                                                                        BIBLIOGRAPHY

                                                                                        [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                                        [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                                        [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                                        [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                                        [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                                        [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                                        [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                                        [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                                        [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                                        [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                        [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                        48

                                                                                        BIBLIOGRAPHY

                                                                                        [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                                        [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                                        [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                                        [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                                        [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                                        [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                                        [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                                        [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                                        [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                                        [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                                        [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                                        [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                                        49

                                                                                        BIBLIOGRAPHY

                                                                                        models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                                        [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                                        [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                                        [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                                        [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                                        [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                                        [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                                        [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                                        [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                                        [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                                        [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                                        50

                                                                                        BIBLIOGRAPHY

                                                                                        [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                                        [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                                        [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                                        [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                                        [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                                        [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                                        [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                                        51

                                                                                        TRITA TRITA-ITM-EX 2019606

                                                                                        wwwkthse

                                                                                        • Introduction
                                                                                          • Background
                                                                                          • Problem Description
                                                                                          • Purpose Definitions amp Research Questions
                                                                                          • Scope and Delimitations
                                                                                          • Method Description
                                                                                            • Frame of Reference
                                                                                              • Filtration amp Clogging Indicators
                                                                                                • Basket Filter
                                                                                                • Self-Cleaning Basket Filters
                                                                                                • Manometer
                                                                                                • The Clogging Phenomena
                                                                                                • Physics-based Modelling
                                                                                                  • Predictive Analytics
                                                                                                    • Classification Error Metrics
                                                                                                    • Regression Error Metrics
                                                                                                    • Stochastic Time Series Models
                                                                                                      • Neural Networks
                                                                                                        • Overview
                                                                                                        • The Perceptron
                                                                                                        • Activation functions
                                                                                                        • Neural Network Architectures
                                                                                                            • Experimental Development
                                                                                                              • Data Gathering and Processing
                                                                                                              • Model Generation
                                                                                                                • Regression Processing with the LSTM Model
                                                                                                                • Regression Processing with the CNN Model
                                                                                                                • Label Classification
                                                                                                                  • Model evaluation
                                                                                                                  • Hardware Specifications
                                                                                                                    • Results
                                                                                                                      • LSTM Performance
                                                                                                                      • CNN Performance
                                                                                                                        • Discussion amp Conclusion
                                                                                                                          • The LSTM Network
                                                                                                                            • Regression Analysis
                                                                                                                            • Classification Analysis
                                                                                                                              • The CNN
                                                                                                                                • Regression Analysis
                                                                                                                                • Classification Analysis
                                                                                                                                  • Comparison Between Both Networks
                                                                                                                                  • Conclusion
                                                                                                                                    • Future Work
                                                                                                                                    • Bibliography

                                                                                          Chapter 5

                                                                                          Discussion amp Conclusion

                                                                                          This chapter contains a discussion about the results presents the conclusion of thethesis and answers the research questions

                                                                                          51 The LSTM Network

                                                                                          511 Regression AnalysisFor the regression analysis the network trained on the MAE loss function managedto achieve better performance than the network trained on the MSE loss functionas can be noted from the regression metrics in Table 41 As the epochs increase thevalidation loss can be noted to be fluctuating for the MSE while the MAE appearsto continue to decrease as seen in Figure 41 This is a potential indicator that thenetwork trained on the MSE loss function may be beginning to overfit on the train-ing data as the epochs increase The network trained on the MAE loss functionhowever could still be learning features in the data although the validation lossdidnrsquot decrease over the last 150 epochs

                                                                                          Observing the result in Figures 42 and 43 the data predicted by the MAE modelfit well to the actual data as is expected since the model is only outputting a onestep prediction Some difficulties can be seen when there is a large difference be-tween the actual value and the previous value as can occur when the dataset changesfrom one test cycle to the next The MSE model had a more difficult time in copingwith a large change in value that occurs in the differential pressure data which isnrsquotunlikely as the regression model is particularly sensitive to outliers

                                                                                          The high r2-score of 0981 for the MAE network supports the claim that the model isparticularly good at fitting the unseen data despite there being notable differencesto the starting and finishing values of the differential pressure as well as the systemflow rate for the different test runs The network could therefore given the righttraining data be sufficient in learning and predicting the patterns for cycles thatoperate at higher or lower differential pressure andor system flow rate The MSE

                                                                                          41

                                                                                          CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                                          while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                                                          512 Classification Analysis

                                                                                          As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                                                          The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                                                          52 The CNN

                                                                                          521 Regression Analysis

                                                                                          The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                                                          Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                                                          42

                                                                                          52 THE CNN

                                                                                          is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                                                          Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                                                          522 Classification Analysis

                                                                                          Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                                                          Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                                                          However the CNN is still doing a good job at predicting future clogging even

                                                                                          43

                                                                                          CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                                          up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                                                          53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                                                          54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                                                          As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                                                          44

                                                                                          Chapter 6

                                                                                          Future Work

                                                                                          In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                                                          For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                                                          On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                                                          Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                                                          45

                                                                                          Bibliography

                                                                                          [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                                          [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                                          [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                                          [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                                          [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                                          [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                                          [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                                          [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                                          [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                                          [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                                          47

                                                                                          BIBLIOGRAPHY

                                                                                          [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                                          [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                                          [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                                          [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                                          [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                                          [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                                          [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                                          [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                                          [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                                          [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                          [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                          48

                                                                                          BIBLIOGRAPHY

                                                                                          [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                                          [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                                          [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                                          [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                                          [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                                          [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                                          [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                                          [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                                          [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                                          [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                                          [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                                          [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                                          49

                                                                                          BIBLIOGRAPHY

                                                                                          models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                                          [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                                          [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                                          [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                                          [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                                          [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                                          [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                                          [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                                          [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                                          [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                                          [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                                          50

                                                                                          BIBLIOGRAPHY

                                                                                          [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                                          [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                                          [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                                          [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                                          [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                                          [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                                          [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                                          51

                                                                                          TRITA TRITA-ITM-EX 2019606

                                                                                          wwwkthse

                                                                                          • Introduction
                                                                                            • Background
                                                                                            • Problem Description
                                                                                            • Purpose Definitions amp Research Questions
                                                                                            • Scope and Delimitations
                                                                                            • Method Description
                                                                                              • Frame of Reference
                                                                                                • Filtration amp Clogging Indicators
                                                                                                  • Basket Filter
                                                                                                  • Self-Cleaning Basket Filters
                                                                                                  • Manometer
                                                                                                  • The Clogging Phenomena
                                                                                                  • Physics-based Modelling
                                                                                                    • Predictive Analytics
                                                                                                      • Classification Error Metrics
                                                                                                      • Regression Error Metrics
                                                                                                      • Stochastic Time Series Models
                                                                                                        • Neural Networks
                                                                                                          • Overview
                                                                                                          • The Perceptron
                                                                                                          • Activation functions
                                                                                                          • Neural Network Architectures
                                                                                                              • Experimental Development
                                                                                                                • Data Gathering and Processing
                                                                                                                • Model Generation
                                                                                                                  • Regression Processing with the LSTM Model
                                                                                                                  • Regression Processing with the CNN Model
                                                                                                                  • Label Classification
                                                                                                                    • Model evaluation
                                                                                                                    • Hardware Specifications
                                                                                                                      • Results
                                                                                                                        • LSTM Performance
                                                                                                                        • CNN Performance
                                                                                                                          • Discussion amp Conclusion
                                                                                                                            • The LSTM Network
                                                                                                                              • Regression Analysis
                                                                                                                              • Classification Analysis
                                                                                                                                • The CNN
                                                                                                                                  • Regression Analysis
                                                                                                                                  • Classification Analysis
                                                                                                                                    • Comparison Between Both Networks
                                                                                                                                    • Conclusion
                                                                                                                                      • Future Work
                                                                                                                                      • Bibliography

                                                                                            CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                                            while still achieving a good r2-score of 0694 would be expected to perform badfor data outside of the ordinary range as such a model is better suited for normallydistributed data where there is one single mode

                                                                                            512 Classification Analysis

                                                                                            As a classifier for the clogging labels using the binary cross-entropy loss function theLSTM network was able to achieve a high classification rate of 995 fairly quicklybefore experiencing signs of decreasing accuracy and increasing loss on the valida-tion set as seen in Figure 46 The ROC and F1 scores in Table 42 also indicatea high classification accuracy of the model something that can be better observedin the confusion matrix in Table 43 Out of the 782 samples in the validation setonly 4 were wrongly classified

                                                                                            The result shows a strong argument for using the LSTM as a clogging predictorHowever it must be remembered that each classification is the clogging label attime t based on the values of all the variables from time t-5 to time t-1 where theinteger is a time step The classifications are therefore while highly accurate onlypredictors of the very next value The network would therefore prove little use ina setting where the clogging label has to be predicted for a long time ahead but itcould prove useful for on-the-spot clogging labelling

                                                                                            52 The CNN

                                                                                            521 Regression Analysis

                                                                                            The regression analysis of the CNN showed that the MAE loss function performedbetter than the MSE loss function as can be seen from the metrics in Table 44While the margin between the two loss functions were not as significant as in thecase with the LSTM the r2-scores were 0876 and 0843 respectively with an overalllower score on all of the other metrics for the MAE network As for the training lossand validation loss for both models Figure 47 shows the potential for the validationloss to continue decreasing for the network using the MAE loss function Like inthe case with the LSTM networks the CNN network using the MSE loss functionis showing signs of overfitting on the training data as the validation loss is shownto be increasing

                                                                                            Figures 48 49 410 and 411 show the values predicted by the CNNs to be follow-ing the actual values fairly well with the difference that the network using MAEhandles them better than the network using MSE That result is not necessarilysurprising as the input data have not changed but the fact that the two CNNs aremuch closer in scores suggests that making many future predictions is impairingfor a network using MAE as loss function or that it is due to the data being morenormally distributed due to the convolutional computations A thing to be noted

                                                                                            42

                                                                                            52 THE CNN

                                                                                            is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                                                            Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                                                            522 Classification Analysis

                                                                                            Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                                                            Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                                                            However the CNN is still doing a good job at predicting future clogging even

                                                                                            43

                                                                                            CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                                            up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                                                            53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                                                            54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                                                            As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                                                            44

                                                                                            Chapter 6

                                                                                            Future Work

                                                                                            In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                                                            For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                                                            On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                                                            Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                                                            45

                                                                                            Bibliography

                                                                                            [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                                            [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                                            [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                                            [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                                            [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                                            [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                                            [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                                            [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                                            [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                                            [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                                            47

                                                                                            BIBLIOGRAPHY

                                                                                            [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                                            [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                                            [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                                            [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                                            [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                                            [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                                            [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                                            [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                                            [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                                            [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                            [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                            48

                                                                                            BIBLIOGRAPHY

                                                                                            [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                                            [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                                            [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                                            [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                                            [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                                            [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                                            [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                                            [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                                            [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                                            [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                                            [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                                            [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                                            49

                                                                                            BIBLIOGRAPHY

                                                                                            models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                                            [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                                            [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                                            [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                                            [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                                            [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                                            [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                                            [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                                            [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                                            [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                                            [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                                            50

                                                                                            BIBLIOGRAPHY

                                                                                            [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                                            [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                                            [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                                            [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                                            [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                                            [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                                            [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                                            51

                                                                                            TRITA TRITA-ITM-EX 2019606

                                                                                            wwwkthse

                                                                                            • Introduction
                                                                                              • Background
                                                                                              • Problem Description
                                                                                              • Purpose Definitions amp Research Questions
                                                                                              • Scope and Delimitations
                                                                                              • Method Description
                                                                                                • Frame of Reference
                                                                                                  • Filtration amp Clogging Indicators
                                                                                                    • Basket Filter
                                                                                                    • Self-Cleaning Basket Filters
                                                                                                    • Manometer
                                                                                                    • The Clogging Phenomena
                                                                                                    • Physics-based Modelling
                                                                                                      • Predictive Analytics
                                                                                                        • Classification Error Metrics
                                                                                                        • Regression Error Metrics
                                                                                                        • Stochastic Time Series Models
                                                                                                          • Neural Networks
                                                                                                            • Overview
                                                                                                            • The Perceptron
                                                                                                            • Activation functions
                                                                                                            • Neural Network Architectures
                                                                                                                • Experimental Development
                                                                                                                  • Data Gathering and Processing
                                                                                                                  • Model Generation
                                                                                                                    • Regression Processing with the LSTM Model
                                                                                                                    • Regression Processing with the CNN Model
                                                                                                                    • Label Classification
                                                                                                                      • Model evaluation
                                                                                                                      • Hardware Specifications
                                                                                                                        • Results
                                                                                                                          • LSTM Performance
                                                                                                                          • CNN Performance
                                                                                                                            • Discussion amp Conclusion
                                                                                                                              • The LSTM Network
                                                                                                                                • Regression Analysis
                                                                                                                                • Classification Analysis
                                                                                                                                  • The CNN
                                                                                                                                    • Regression Analysis
                                                                                                                                    • Classification Analysis
                                                                                                                                      • Comparison Between Both Networks
                                                                                                                                      • Conclusion
                                                                                                                                        • Future Work
                                                                                                                                        • Bibliography

                                                                                              52 THE CNN

                                                                                              is that the predicted do not follow the actual data to the same oscillatory extentas was observable in the case with the LSTMs which is possibly also a side-effectfrom the feature mapping from the convolutional layer and the max pooling layerNoteworthy is also the fact that the predicted values shown in the plots are thefurthest predictions ie the predicted values the parameters would have in 30 sec-onds showing that while the extremes are never fully captured the models are stillgood at estimating the average system behaviour

                                                                                              Looking at the predicted differential pressure of both models the CNN trained usingthe MAE loss function experiences less overshoot and undershoot than the CNNusing the MSE loss function The overshooting and undershooting could poten-tially lead to erroneous estimation of the values of the variables and thus improperclogging estimation Furthermore the undershooting could potentially result in anunderestimation of the clogging severity

                                                                                              522 Classification Analysis

                                                                                              Using the data from the regression analysis in the classifier we get a classification ac-curacy of 914 on the data from the CNN using MAE and a classification accuracyof 933 using the data from the CNN using MSE Looking further at the metricsin Table 45 it can be noted that while the accuracy and the F1-score is higherfor the data generated by the MSE regression network in comparison to the datagenerated by the MAE regression network the AUC is the other way around Whilethe AUC-score and the F1-score do not differ drastically in magnitude getting asignificantly lower AUC-score than F1-score is typically due to imbalance in thedataset Imbalance in the dataset occurs when the difference between the positivenumber and negative number of examples is large ie there is significantly moreinstances of one class than another The confusion matrices found in Tables 46 and47 confirm that there are significantly more instances of the class Label 2 than thereis Label 1 in the validation and test data However such an imbalance doesnrsquot meanthat the classifier is doing a bad job Instead itrsquos the contrary the classifier is doinga decent job but AUC-score is not an ideal metric for the existing class distributions

                                                                                              Looking at the binary cross-entropy loss and classification accuracy in Figures 412and 413 it can be noted that there is a slight discrepancy between the cross-entropyloss and classification accuracy where the loss function is increasing while the ac-curacy is also increasing This is particularly true for the validation data Just likein the previous case with the AUC-score and F1-score an imbalanced distributionallows to easily obtain good accuracy As the majority of the labels are of type 2good accuracy can be achieved by the model by predicting 2 repeatedly To ensurethat the classification network is actually learning from the data and that it isnrsquotoverfitting for one certain class a more balanced dataset would be required

                                                                                              However the CNN is still doing a good job at predicting future clogging even

                                                                                              43

                                                                                              CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                                              up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                                                              53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                                                              54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                                                              As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                                                              44

                                                                                              Chapter 6

                                                                                              Future Work

                                                                                              In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                                                              For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                                                              On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                                                              Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                                                              45

                                                                                              Bibliography

                                                                                              [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                                              [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                                              [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                                              [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                                              [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                                              [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                                              [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                                              [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                                              [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                                              [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                                              47

                                                                                              BIBLIOGRAPHY

                                                                                              [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                                              [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                                              [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                                              [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                                              [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                                              [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                                              [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                                              [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                                              [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                                              [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                              [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                              48

                                                                                              BIBLIOGRAPHY

                                                                                              [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                                              [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                                              [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                                              [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                                              [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                                              [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                                              [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                                              [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                                              [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                                              [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                                              [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                                              [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                                              49

                                                                                              BIBLIOGRAPHY

                                                                                              models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                                              [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                                              [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                                              [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                                              [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                                              [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                                              [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                                              [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                                              [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                                              [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                                              [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                                              50

                                                                                              BIBLIOGRAPHY

                                                                                              [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                                              [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                                              [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                                              [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                                              [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                                              [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                                              [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                                              51

                                                                                              TRITA TRITA-ITM-EX 2019606

                                                                                              wwwkthse

                                                                                              • Introduction
                                                                                                • Background
                                                                                                • Problem Description
                                                                                                • Purpose Definitions amp Research Questions
                                                                                                • Scope and Delimitations
                                                                                                • Method Description
                                                                                                  • Frame of Reference
                                                                                                    • Filtration amp Clogging Indicators
                                                                                                      • Basket Filter
                                                                                                      • Self-Cleaning Basket Filters
                                                                                                      • Manometer
                                                                                                      • The Clogging Phenomena
                                                                                                      • Physics-based Modelling
                                                                                                        • Predictive Analytics
                                                                                                          • Classification Error Metrics
                                                                                                          • Regression Error Metrics
                                                                                                          • Stochastic Time Series Models
                                                                                                            • Neural Networks
                                                                                                              • Overview
                                                                                                              • The Perceptron
                                                                                                              • Activation functions
                                                                                                              • Neural Network Architectures
                                                                                                                  • Experimental Development
                                                                                                                    • Data Gathering and Processing
                                                                                                                    • Model Generation
                                                                                                                      • Regression Processing with the LSTM Model
                                                                                                                      • Regression Processing with the CNN Model
                                                                                                                      • Label Classification
                                                                                                                        • Model evaluation
                                                                                                                        • Hardware Specifications
                                                                                                                          • Results
                                                                                                                            • LSTM Performance
                                                                                                                            • CNN Performance
                                                                                                                              • Discussion amp Conclusion
                                                                                                                                • The LSTM Network
                                                                                                                                  • Regression Analysis
                                                                                                                                  • Classification Analysis
                                                                                                                                    • The CNN
                                                                                                                                      • Regression Analysis
                                                                                                                                      • Classification Analysis
                                                                                                                                        • Comparison Between Both Networks
                                                                                                                                        • Conclusion
                                                                                                                                          • Future Work
                                                                                                                                          • Bibliography

                                                                                                CHAPTER 5 DISCUSSION amp CONCLUSION

                                                                                                up to 30 seconds away Some misclassification is present which could be improvedby having specific datasets dedicated to label 1 and label 2

                                                                                                53 Comparison Between Both NetworksThe parameter estimation during both regression and classification is of an overallbetter result for the LSTM than for the CNN However that is to be expected asthe CNN is predicting multiple steps ahead and if the first prediction is off targetthen the following predictions will be impacted by the initial error However a keyfactor to take into account in the comparison is that the CNN makes predictionsfor a more distant time horizon in comparison to the LSTM Being able to predictthe system variables and the state of clogging for 30 seconds or more ahead of timewould ensure higher safety and lessen the chance of pushing the filter to the extremeand potentially damaging or destroying it

                                                                                                54 ConclusionThe purpose of the main research question was to investigate and evaluate thepossibility of using ML to predictively estimate filter clogging The result of theregression and classification analysis show that ML can be used for estimating fil-ter clogging states With classification accuracy for the CNN reaching 914 and933 respectively NNs can be considered a substitute to physics-based modellingAlthough as there were no samples present in the supplied data where a filter wasfully clogged no conclusion can be drawn about the NNs capability in estimatinga fully clogged filter Further evaluation with appropriate data would be requiredto fully explore ML models and NNs capabilities in detecting full filter clogging

                                                                                                As for the second research question that regarded how an NN can be integratedwith a BWTS the conclusion is that the data from the BWTS is required to bepre-processed according to the desired prediction interval before it can be processedby the NN The pre-processing method is dependant on both the NN model of choiceand how many future predictions the model is supposed to generate so no generalsolution can be provided However two pre-processing strategies are proposed inthe thesis for an LSTM and a CNN

                                                                                                44

                                                                                                Chapter 6

                                                                                                Future Work

                                                                                                In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                                                                For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                                                                On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                                                                Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                                                                45

                                                                                                Bibliography

                                                                                                [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                                                [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                                                [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                                                [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                                                [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                                                [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                                                [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                                                [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                                                [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                                                [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                                                47

                                                                                                BIBLIOGRAPHY

                                                                                                [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                                                [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                                                [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                                                [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                                                [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                                                [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                                                [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                                                [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                                                [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                                                [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                                [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                                48

                                                                                                BIBLIOGRAPHY

                                                                                                [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                                                [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                                                [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                                                [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                                                [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                                                [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                                                [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                                                [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                                                [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                                                [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                                                [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                                                [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                                                49

                                                                                                BIBLIOGRAPHY

                                                                                                models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                                                [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                                                [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                                                [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                                                [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                                                [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                                                [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                                                [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                                                [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                                                [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                                                [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                                                50

                                                                                                BIBLIOGRAPHY

                                                                                                [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                                                [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                                                [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                                                [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                                                [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                                                [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                                                [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                                                51

                                                                                                TRITA TRITA-ITM-EX 2019606

                                                                                                wwwkthse

                                                                                                • Introduction
                                                                                                  • Background
                                                                                                  • Problem Description
                                                                                                  • Purpose Definitions amp Research Questions
                                                                                                  • Scope and Delimitations
                                                                                                  • Method Description
                                                                                                    • Frame of Reference
                                                                                                      • Filtration amp Clogging Indicators
                                                                                                        • Basket Filter
                                                                                                        • Self-Cleaning Basket Filters
                                                                                                        • Manometer
                                                                                                        • The Clogging Phenomena
                                                                                                        • Physics-based Modelling
                                                                                                          • Predictive Analytics
                                                                                                            • Classification Error Metrics
                                                                                                            • Regression Error Metrics
                                                                                                            • Stochastic Time Series Models
                                                                                                              • Neural Networks
                                                                                                                • Overview
                                                                                                                • The Perceptron
                                                                                                                • Activation functions
                                                                                                                • Neural Network Architectures
                                                                                                                    • Experimental Development
                                                                                                                      • Data Gathering and Processing
                                                                                                                      • Model Generation
                                                                                                                        • Regression Processing with the LSTM Model
                                                                                                                        • Regression Processing with the CNN Model
                                                                                                                        • Label Classification
                                                                                                                          • Model evaluation
                                                                                                                          • Hardware Specifications
                                                                                                                            • Results
                                                                                                                              • LSTM Performance
                                                                                                                              • CNN Performance
                                                                                                                                • Discussion amp Conclusion
                                                                                                                                  • The LSTM Network
                                                                                                                                    • Regression Analysis
                                                                                                                                    • Classification Analysis
                                                                                                                                      • The CNN
                                                                                                                                        • Regression Analysis
                                                                                                                                        • Classification Analysis
                                                                                                                                          • Comparison Between Both Networks
                                                                                                                                          • Conclusion
                                                                                                                                            • Future Work
                                                                                                                                            • Bibliography

                                                                                                  Chapter 6

                                                                                                  Future Work

                                                                                                  In this thesis work it was possible to classify two out of the three clogging categoriesfrom the used data To further evaluate the suitability of using ML for cloggingclassification it would be required to perform additional tests with data includingall clogging states Furthermore as all the tests were done around the same levelof system flow rate it would be interesting to see if model performance is enhancedor impaired by training one network on multiple datasets containing different lev-els of system flow rate To add to this performing tests or using data where thewater source contains higher or lower amounts of TSS would be interesting to seethe change in model performance As the presence of TSS greatly affects the filtersclogging speed an added sensor that measures TSS could prove useful for providingthe model with an independent parameter that indicates that filter clogging is morelikely regardless of the workload on the filter

                                                                                                  For the neural networks it would be required to see how they perform with allclogging states available in the data as well as how they perform with a more bal-anced dataset ie if classification increases or decreases For the LSTM it wouldbe interesting to see how well it could perform in predicting multiple time stepsahead like the CNN It would also be interesting to evaluate the optimisation prob-lem in optimising the LSTM and CNN by having data containing all clogging labels

                                                                                                  On the contrary it would also be of interest to see if more traditional statisticalmodels such as ARIMA and SARIMA perform better for predicting filter cloggingThe plus-side with using these methods is that the ins and outs are better knownfor older statistical models than they are for ML models

                                                                                                  Lastly time criticality in the filter clogging classification to avoid complete clog-ging will have to be taken into consideration both when deciding on the type ofstatistical model type of network network architecture and amount of data to beprocessed at a time if they are to be used in the BWTS

                                                                                                  45

                                                                                                  Bibliography

                                                                                                  [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                                                  [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                                                  [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                                                  [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                                                  [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                                                  [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                                                  [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                                                  [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                                                  [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                                                  [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                                                  47

                                                                                                  BIBLIOGRAPHY

                                                                                                  [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                                                  [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                                                  [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                                                  [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                                                  [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                                                  [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                                                  [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                                                  [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                                                  [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                                                  [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                                  [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                                  48

                                                                                                  BIBLIOGRAPHY

                                                                                                  [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                                                  [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                                                  [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                                                  [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                                                  [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                                                  [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                                                  [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                                                  [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                                                  [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                                                  [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                                                  [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                                                  [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                                                  49

                                                                                                  BIBLIOGRAPHY

                                                                                                  models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                                                  [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                                                  [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                                                  [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                                                  [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                                                  [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                                                  [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                                                  [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                                                  [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                                                  [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                                                  [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                                                  50

                                                                                                  BIBLIOGRAPHY

                                                                                                  [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                                                  [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                                                  [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                                                  [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                                                  [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                                                  [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                                                  [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                                                  51

                                                                                                  TRITA TRITA-ITM-EX 2019606

                                                                                                  wwwkthse

                                                                                                  • Introduction
                                                                                                    • Background
                                                                                                    • Problem Description
                                                                                                    • Purpose Definitions amp Research Questions
                                                                                                    • Scope and Delimitations
                                                                                                    • Method Description
                                                                                                      • Frame of Reference
                                                                                                        • Filtration amp Clogging Indicators
                                                                                                          • Basket Filter
                                                                                                          • Self-Cleaning Basket Filters
                                                                                                          • Manometer
                                                                                                          • The Clogging Phenomena
                                                                                                          • Physics-based Modelling
                                                                                                            • Predictive Analytics
                                                                                                              • Classification Error Metrics
                                                                                                              • Regression Error Metrics
                                                                                                              • Stochastic Time Series Models
                                                                                                                • Neural Networks
                                                                                                                  • Overview
                                                                                                                  • The Perceptron
                                                                                                                  • Activation functions
                                                                                                                  • Neural Network Architectures
                                                                                                                      • Experimental Development
                                                                                                                        • Data Gathering and Processing
                                                                                                                        • Model Generation
                                                                                                                          • Regression Processing with the LSTM Model
                                                                                                                          • Regression Processing with the CNN Model
                                                                                                                          • Label Classification
                                                                                                                            • Model evaluation
                                                                                                                            • Hardware Specifications
                                                                                                                              • Results
                                                                                                                                • LSTM Performance
                                                                                                                                • CNN Performance
                                                                                                                                  • Discussion amp Conclusion
                                                                                                                                    • The LSTM Network
                                                                                                                                      • Regression Analysis
                                                                                                                                      • Classification Analysis
                                                                                                                                        • The CNN
                                                                                                                                          • Regression Analysis
                                                                                                                                          • Classification Analysis
                                                                                                                                            • Comparison Between Both Networks
                                                                                                                                            • Conclusion
                                                                                                                                              • Future Work
                                                                                                                                              • Bibliography

                                                                                                    Bibliography

                                                                                                    [1] Maninder Kaur Meghna Dhalaria Pradip Sharma and Jong Park Supervisedmachine-learning predictive analytics for national quality of life scoring AppliedSciences 91613 04 2019

                                                                                                    [2] William W Hsieh and Benyang Tang Applying neural network models toprediction and data analysis in meteorology and oceanography Bulletin of theAmerican Meteorological Society 79(9)1855ndash1870 1998

                                                                                                    [3] Basket and self-cleaning filters Filtration and Separation 55(4)18 ndash 21 2018

                                                                                                    [4] OF Eker Fatih Camci and Ian K Jennions Filter clogging data collection forprognostics In Proceedings of the Annual Conference of the Prognostics andHealth Management Society New Orleans USA pages 14ndash17 2013

                                                                                                    [5] OF Eker Faith Camci and Ian K Jennions Physics-based degradation mod-elling for filter clogging In The 2nd European Conference of the Prognosticsand Health Management (PHM) Society Nantes France volume 5 page 20142014

                                                                                                    [6] Omer F Eker Fatih Camci and Ian K Jennions Physics-based prognosticmodelling of filter clogging phenomena Mechanical Systems and Signal Pro-cessing 75395 ndash 412 2016

                                                                                                    [7] N Roussel Thi Lien Huong Nguyen and P Coussot General probabilisticapproach to the filtration process Phys Rev Lett 98114502 Mar 2007

                                                                                                    [8] Richard Wakeman Filter media Testing for liquid filtration Filtration andSeparation 44(3)32 ndash 34 2007

                                                                                                    [9] Wikipedia Darcyrsquos law mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiDarcy27s_law[Accessed Apr 26 2019]

                                                                                                    [10] Wikipedia KozenyndashCarman equation mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiKozenyE28093Carman_equation[Accessed Apr 26 2019]

                                                                                                    47

                                                                                                    BIBLIOGRAPHY

                                                                                                    [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                                                    [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                                                    [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                                                    [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                                                    [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                                                    [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                                                    [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                                                    [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                                                    [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                                                    [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                                    [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                                    48

                                                                                                    BIBLIOGRAPHY

                                                                                                    [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                                                    [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                                                    [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                                                    [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                                                    [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                                                    [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                                                    [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                                                    [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                                                    [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                                                    [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                                                    [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                                                    [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                                                    49

                                                                                                    BIBLIOGRAPHY

                                                                                                    models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                                                    [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                                                    [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                                                    [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                                                    [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                                                    [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                                                    [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                                                    [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                                                    [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                                                    [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                                                    [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                                                    50

                                                                                                    BIBLIOGRAPHY

                                                                                                    [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                                                    [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                                                    [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                                                    [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                                                    [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                                                    [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                                                    [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                                                    51

                                                                                                    TRITA TRITA-ITM-EX 2019606

                                                                                                    wwwkthse

                                                                                                    • Introduction
                                                                                                      • Background
                                                                                                      • Problem Description
                                                                                                      • Purpose Definitions amp Research Questions
                                                                                                      • Scope and Delimitations
                                                                                                      • Method Description
                                                                                                        • Frame of Reference
                                                                                                          • Filtration amp Clogging Indicators
                                                                                                            • Basket Filter
                                                                                                            • Self-Cleaning Basket Filters
                                                                                                            • Manometer
                                                                                                            • The Clogging Phenomena
                                                                                                            • Physics-based Modelling
                                                                                                              • Predictive Analytics
                                                                                                                • Classification Error Metrics
                                                                                                                • Regression Error Metrics
                                                                                                                • Stochastic Time Series Models
                                                                                                                  • Neural Networks
                                                                                                                    • Overview
                                                                                                                    • The Perceptron
                                                                                                                    • Activation functions
                                                                                                                    • Neural Network Architectures
                                                                                                                        • Experimental Development
                                                                                                                          • Data Gathering and Processing
                                                                                                                          • Model Generation
                                                                                                                            • Regression Processing with the LSTM Model
                                                                                                                            • Regression Processing with the CNN Model
                                                                                                                            • Label Classification
                                                                                                                              • Model evaluation
                                                                                                                              • Hardware Specifications
                                                                                                                                • Results
                                                                                                                                  • LSTM Performance
                                                                                                                                  • CNN Performance
                                                                                                                                    • Discussion amp Conclusion
                                                                                                                                      • The LSTM Network
                                                                                                                                        • Regression Analysis
                                                                                                                                        • Classification Analysis
                                                                                                                                          • The CNN
                                                                                                                                            • Regression Analysis
                                                                                                                                            • Classification Analysis
                                                                                                                                              • Comparison Between Both Networks
                                                                                                                                              • Conclusion
                                                                                                                                                • Future Work
                                                                                                                                                • Bibliography

                                                                                                      BIBLIOGRAPHY

                                                                                                      [11] Wikipedia Ergun equation mdash Wikipedia the free encyclopedia 2019 Avail-able at httpsenwikipediaorgwikiErgun_equation[Accessed Apr 262019]

                                                                                                      [12] B Hadji Misheva P Giudici and V Pediroda Network-based models toimprove credit scoring accuracy In 2018 IEEE 5th International Conferenceon Data Science and Advanced Analytics (DSAA) pages 623ndash630 Oct 2018

                                                                                                      [13] K Deepika and S Seema Predictive analytics to prevent and control chronicdiseases In 2016 2nd International Conference on Applied and TheoreticalComputing and Communication Technology (iCATccT) pages 381ndash386 Ben-galuru Karnataka India July 2016

                                                                                                      [14] Hamza Belarbi Abdelali Tajmouati Hamid Bennis and M El Haj Tirari Pre-dictive analysis of big data in retail industry In Proceedings of the InternationalConference on Computing Wireless and Communication Systems Larache Mo-rocco 2016

                                                                                                      [15] Daniela Borissova Ivan Mustakerov and Lyubka Doukovska Predictive main-tenance sensors placement by combinatorial optimization International Jour-nal of Electronics and Telecommunications 58(2)153 ndash 158 2012

                                                                                                      [16] A Kanawaday and A Sane Machine learning for predictive maintenance ofindustrial machines using iot sensor data In 2017 8th IEEE InternationalConference on Software Engineering and Service Science (ICSESS) pages 87ndash90 Beijing China Nov 2017

                                                                                                      [17] G A Susto A Schirru S Pampuri S McLoone and A Beghi Machinelearning for predictive maintenance A multiple classifier approach IEEETransactions on Industrial Informatics 11(3)812ndash820 June 2015

                                                                                                      [18] Rikard Konig Predictive techniques and methods for decision support in situa-tions with poor data quality 2009 Available at httpurnkbseresolveurn=urnnbnsehbdiva-3517

                                                                                                      [19] Foster Provost Tom Fawcett and Ron Kohavi The case against accuracyestimation for comparing induction algorithms In In Proceedings of the Fif-teenth International Conference on Machine Learning pages 445ndash453 MorganKaufmann 1998

                                                                                                      [20] Stephan Spiegel Fabian Mueller Dorothea Weismann and John Bird Cost-sensitive learning for predictive maintenance abs180910979 2018 Availableat httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                                      [21] David M W Powers What the f-measure doesnrsquot measure Features flawsfallacies and fixes abs150306410 2015 Available at httpsarxivorgabs150306410[Accessed Oct 03 2019]

                                                                                                      48

                                                                                                      BIBLIOGRAPHY

                                                                                                      [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                                                      [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                                                      [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                                                      [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                                                      [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                                                      [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                                                      [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                                                      [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                                                      [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                                                      [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                                                      [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                                                      [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                                                      49

                                                                                                      BIBLIOGRAPHY

                                                                                                      models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                                                      [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                                                      [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                                                      [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                                                      [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                                                      [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                                                      [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                                                      [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                                                      [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                                                      [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                                                      [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                                                      50

                                                                                                      BIBLIOGRAPHY

                                                                                                      [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                                                      [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                                                      [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                                                      [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                                                      [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                                                      [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                                                      [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                                                      51

                                                                                                      TRITA TRITA-ITM-EX 2019606

                                                                                                      wwwkthse

                                                                                                      • Introduction
                                                                                                        • Background
                                                                                                        • Problem Description
                                                                                                        • Purpose Definitions amp Research Questions
                                                                                                        • Scope and Delimitations
                                                                                                        • Method Description
                                                                                                          • Frame of Reference
                                                                                                            • Filtration amp Clogging Indicators
                                                                                                              • Basket Filter
                                                                                                              • Self-Cleaning Basket Filters
                                                                                                              • Manometer
                                                                                                              • The Clogging Phenomena
                                                                                                              • Physics-based Modelling
                                                                                                                • Predictive Analytics
                                                                                                                  • Classification Error Metrics
                                                                                                                  • Regression Error Metrics
                                                                                                                  • Stochastic Time Series Models
                                                                                                                    • Neural Networks
                                                                                                                      • Overview
                                                                                                                      • The Perceptron
                                                                                                                      • Activation functions
                                                                                                                      • Neural Network Architectures
                                                                                                                          • Experimental Development
                                                                                                                            • Data Gathering and Processing
                                                                                                                            • Model Generation
                                                                                                                              • Regression Processing with the LSTM Model
                                                                                                                              • Regression Processing with the CNN Model
                                                                                                                              • Label Classification
                                                                                                                                • Model evaluation
                                                                                                                                • Hardware Specifications
                                                                                                                                  • Results
                                                                                                                                    • LSTM Performance
                                                                                                                                    • CNN Performance
                                                                                                                                      • Discussion amp Conclusion
                                                                                                                                        • The LSTM Network
                                                                                                                                          • Regression Analysis
                                                                                                                                          • Classification Analysis
                                                                                                                                            • The CNN
                                                                                                                                              • Regression Analysis
                                                                                                                                              • Classification Analysis
                                                                                                                                                • Comparison Between Both Networks
                                                                                                                                                • Conclusion
                                                                                                                                                  • Future Work
                                                                                                                                                  • Bibliography

                                                                                                        BIBLIOGRAPHY

                                                                                                        [22] Wikipedia F1 score mdash Wikipedia the free encyclopedia httpenwikipediaorgwindexphptitle=F120scoreampoldid=874064435 2019[Online accessed 08-May-2019]

                                                                                                        [23] Deep Learning Course Wiki Log Loss mdash Deep Learning Course Wiki - Fastai 2019 Available at httpwikifastaiindexphpLog_Loss[Accessed Oct03 2019]

                                                                                                        [24] Weijie Wang and Yanmin Lu Analysis of the mean absolute error (mae) and theroot mean square error (rmse) in assessing rounding model IOP ConferenceSeries Materials Science and Engineering 324012049 03 Kuala LumpurMalaysia 2018

                                                                                                        [25] Alexei Botchkarev Performance metrics (error measures) in machine learningregression forecasting and prognostics Properties and typology Interdisci-plinary Journal of Information Knowledge and Management 2019 14 45-79abs180903006 2018

                                                                                                        [26] T Chai and R R Draxler Root mean square error (rmse) or mean absoluteerror (mae) ndash arguments against avoiding rmse in the literature GeoscientificModel Development 7(3)1247ndash1250 2014

                                                                                                        [27] Tom Fomby Scoring measures for prediction problems Presentation by TFomby at Southern Methodist University Dallas TX 75275 175 60806 and2408 2006

                                                                                                        [28] Sungil Kim and Heeyoung Kim A new metric of absolute percentage error forintermittent demand forecasts International Journal of Forecasting 32669ndash679 07 2016

                                                                                                        [29] Wikipedia Coefficient of Determination mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiCoefficient_of_determination[Accessed Oct 03 2019]

                                                                                                        [30] Ratnadip Adhikari and R Agrawal An Introductory Study on Time seriesModeling and Forecasting LAP LAMBERT Academic Publishing 2013

                                                                                                        [31] Wikipedia Artificial neural network mdash Wikipedia the free encyclope-dia 2019 Available at httpsenpediaorgwikiArtificial_neural_network[Accessed Sept 4 2019]

                                                                                                        [32] Wikipedia Perceptron mdash Wikipedia the free encyclopedia 2019 Available athttpsenwikipediaorgwikiPerceptron[Accessed Oct 03 2019]

                                                                                                        [33] Bourdes Valerie Stephane Bonnevay Pjg Lisboa Defrance Remy DavidPerol Chabaud Sylvie Bachelot Thomas Gargi Therese and Negrier SylvieComparison of artificial neural network with logistic regression as classification

                                                                                                        49

                                                                                                        BIBLIOGRAPHY

                                                                                                        models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                                                        [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                                                        [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                                                        [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                                                        [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                                                        [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                                                        [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                                                        [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                                                        [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                                                        [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                                                        [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                                                        50

                                                                                                        BIBLIOGRAPHY

                                                                                                        [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                                                        [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                                                        [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                                                        [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                                                        [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                                                        [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                                                        [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                                                        51

                                                                                                        TRITA TRITA-ITM-EX 2019606

                                                                                                        wwwkthse

                                                                                                        • Introduction
                                                                                                          • Background
                                                                                                          • Problem Description
                                                                                                          • Purpose Definitions amp Research Questions
                                                                                                          • Scope and Delimitations
                                                                                                          • Method Description
                                                                                                            • Frame of Reference
                                                                                                              • Filtration amp Clogging Indicators
                                                                                                                • Basket Filter
                                                                                                                • Self-Cleaning Basket Filters
                                                                                                                • Manometer
                                                                                                                • The Clogging Phenomena
                                                                                                                • Physics-based Modelling
                                                                                                                  • Predictive Analytics
                                                                                                                    • Classification Error Metrics
                                                                                                                    • Regression Error Metrics
                                                                                                                    • Stochastic Time Series Models
                                                                                                                      • Neural Networks
                                                                                                                        • Overview
                                                                                                                        • The Perceptron
                                                                                                                        • Activation functions
                                                                                                                        • Neural Network Architectures
                                                                                                                            • Experimental Development
                                                                                                                              • Data Gathering and Processing
                                                                                                                              • Model Generation
                                                                                                                                • Regression Processing with the LSTM Model
                                                                                                                                • Regression Processing with the CNN Model
                                                                                                                                • Label Classification
                                                                                                                                  • Model evaluation
                                                                                                                                  • Hardware Specifications
                                                                                                                                    • Results
                                                                                                                                      • LSTM Performance
                                                                                                                                      • CNN Performance
                                                                                                                                        • Discussion amp Conclusion
                                                                                                                                          • The LSTM Network
                                                                                                                                            • Regression Analysis
                                                                                                                                            • Classification Analysis
                                                                                                                                              • The CNN
                                                                                                                                                • Regression Analysis
                                                                                                                                                • Classification Analysis
                                                                                                                                                  • Comparison Between Both Networks
                                                                                                                                                  • Conclusion
                                                                                                                                                    • Future Work
                                                                                                                                                    • Bibliography

                                                                                                          BIBLIOGRAPHY

                                                                                                          models for variable selection for prediction of breast cancer patient outcomesAdvances in Artificial Neural Systems 201011 Article ID 309841 06 2010

                                                                                                          [34] Abien Fred Agarap Deep learning using rectified linear units (relu)abs180308375 2018 Available at httpsarxivorgabs180308375[Accessed Oct 10 2019]

                                                                                                          [35] Prajit Ramachandran Barret Zoph and Quoc V Le Searching for activationfunctions abs171005941 2017 Available at httpsarxivorgabs171005941[Accessed Oct 10 2019

                                                                                                          [36] Tomasz Szanda la Benchmarking comparison of swish vs other activation func-tions on cifar-10 imageset In Wojciech Zamojski Jacek Mazurkiewicz Jaros lawSugier Tomasz Walkowiak and Janusz Kacprzyk editors Engineering in De-pendability of Computer Systems and Networks pages 498ndash505 Cham 2020Springer International Publishing

                                                                                                          [37] Wikipedia Feedforward neural network mdash Wikipedia the free encyclopedia2019 Available at httpsenwikipediaorgwikiFeedforward_neural_network[Accessed Oct 03 2019]

                                                                                                          [38] Wikipedia Radial basis function network mdash Wikipedia the free encyclo-pedia 2019 Available at httpsenwikipediaorgwikiRadial_basis_function_network[Accessed Oct 03 2019]

                                                                                                          [39] Ian Goodfellow Yoshua Bengio and Aaron Courville Deep Learning MITPress 2016 Available at httpwwwdeeplearningbookorg[Accessed Oct03 2019

                                                                                                          [40] Sergey Zagoruyko and Nikos Komodakis Wide residual networks In EdwinR Hancock Richard C Wilson and William A P Smith editors Proceedingsof the British Machine Vision Conference (BMVC) pages 871ndash8712 BMVAPress September 2016

                                                                                                          [41] Lei Jimmy Ba and Rich Caruana Do deep nets really need to be deep InProceedings of the 27th International Conference on Neural Information Pro-cessing Systems - Volume 2 NIPSrsquo14 pages 2654ndash2662 Cambridge MA USA2014 MIT Press

                                                                                                          [42] Sepp Hochreiter The vanishing gradient problem during learning recurrentneural nets and problem solutions International Journal of Uncertainty Fuzzi-ness and Knowledge-Based Systems 6107ndash116 04 1998

                                                                                                          [43] Wikipedia Long short-term memory mdash Wikipedia the free encyclope-dia 2019 Available at httpsenwikipediaorgwikiLong_short-term_memory[Accessed Oct 03 2019]

                                                                                                          50

                                                                                                          BIBLIOGRAPHY

                                                                                                          [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                                                          [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                                                          [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                                                          [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                                                          [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                                                          [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                                                          [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                                                          51

                                                                                                          TRITA TRITA-ITM-EX 2019606

                                                                                                          wwwkthse

                                                                                                          • Introduction
                                                                                                            • Background
                                                                                                            • Problem Description
                                                                                                            • Purpose Definitions amp Research Questions
                                                                                                            • Scope and Delimitations
                                                                                                            • Method Description
                                                                                                              • Frame of Reference
                                                                                                                • Filtration amp Clogging Indicators
                                                                                                                  • Basket Filter
                                                                                                                  • Self-Cleaning Basket Filters
                                                                                                                  • Manometer
                                                                                                                  • The Clogging Phenomena
                                                                                                                  • Physics-based Modelling
                                                                                                                    • Predictive Analytics
                                                                                                                      • Classification Error Metrics
                                                                                                                      • Regression Error Metrics
                                                                                                                      • Stochastic Time Series Models
                                                                                                                        • Neural Networks
                                                                                                                          • Overview
                                                                                                                          • The Perceptron
                                                                                                                          • Activation functions
                                                                                                                          • Neural Network Architectures
                                                                                                                              • Experimental Development
                                                                                                                                • Data Gathering and Processing
                                                                                                                                • Model Generation
                                                                                                                                  • Regression Processing with the LSTM Model
                                                                                                                                  • Regression Processing with the CNN Model
                                                                                                                                  • Label Classification
                                                                                                                                    • Model evaluation
                                                                                                                                    • Hardware Specifications
                                                                                                                                      • Results
                                                                                                                                        • LSTM Performance
                                                                                                                                        • CNN Performance
                                                                                                                                          • Discussion amp Conclusion
                                                                                                                                            • The LSTM Network
                                                                                                                                              • Regression Analysis
                                                                                                                                              • Classification Analysis
                                                                                                                                                • The CNN
                                                                                                                                                  • Regression Analysis
                                                                                                                                                  • Classification Analysis
                                                                                                                                                    • Comparison Between Both Networks
                                                                                                                                                    • Conclusion
                                                                                                                                                      • Future Work
                                                                                                                                                      • Bibliography

                                                                                                            BIBLIOGRAPHY

                                                                                                            [44] Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio Em-pirical evaluation of gated recurrent neural networks on sequence modelingIn Neural Information Processing Systems 2014 Workshop on Deep LearningDecember 2014 2014

                                                                                                            [45] Dominik Scherer Andreas Muller and Sven Behnke Evaluation of pooling op-erations in convolutional architectures for object recognition In KonstantinosDiamantaras Wlodek Duch and Lazaros S Iliadis editors Artificial NeuralNetworks ndash ICANN 2010 pages 92ndash101 Berlin Heidelberg 2010 SpringerBerlin Heidelberg

                                                                                                            [46] A Sufian F Sultana and P Dutta ldquoadvancements in image classification us-ing convolutional neural networkrdquo In 2018 Fourth International Conference onResearch in Computational Intelligence and Communication Networks (ICR-CICN) page pp 122ndash129 Nov 2018

                                                                                                            [47] Wenpeng Yin Katharina Kann Mo Yu and Hinrich Schutze Comparativestudy of CNN and RNN for natural language processing abs170201923 2017Available at httparxivorgabs170201923[Accessed Oct 03 2019]

                                                                                                            [48] Anthony Brunel Johanna Pasquet Jerome Pasquet Nancy RodriguezFrederic Comby Dominique Fouchez and Marc Chaumont A CNN adaptedto time series for the classification of Supernovae In Electronic ImagingBurlingame CA United States January 2019

                                                                                                            [49] Cedric Seger An investigation of categorical variable encoding techniques inmachine learning binary versus one-hot and feature hashing 2018 Availableat httpurnkbseresolveurn=urnnbnsekthdiva-237426

                                                                                                            [50] Kedar Potdar Taher Pardawala and Chinmay Pai A comparative study ofcategorical variable encoding techniques for neural network classifiers Inter-national Journal of Computer Applications 1757ndash9 10 2017

                                                                                                            51

                                                                                                            TRITA TRITA-ITM-EX 2019606

                                                                                                            wwwkthse

                                                                                                            • Introduction
                                                                                                              • Background
                                                                                                              • Problem Description
                                                                                                              • Purpose Definitions amp Research Questions
                                                                                                              • Scope and Delimitations
                                                                                                              • Method Description
                                                                                                                • Frame of Reference
                                                                                                                  • Filtration amp Clogging Indicators
                                                                                                                    • Basket Filter
                                                                                                                    • Self-Cleaning Basket Filters
                                                                                                                    • Manometer
                                                                                                                    • The Clogging Phenomena
                                                                                                                    • Physics-based Modelling
                                                                                                                      • Predictive Analytics
                                                                                                                        • Classification Error Metrics
                                                                                                                        • Regression Error Metrics
                                                                                                                        • Stochastic Time Series Models
                                                                                                                          • Neural Networks
                                                                                                                            • Overview
                                                                                                                            • The Perceptron
                                                                                                                            • Activation functions
                                                                                                                            • Neural Network Architectures
                                                                                                                                • Experimental Development
                                                                                                                                  • Data Gathering and Processing
                                                                                                                                  • Model Generation
                                                                                                                                    • Regression Processing with the LSTM Model
                                                                                                                                    • Regression Processing with the CNN Model
                                                                                                                                    • Label Classification
                                                                                                                                      • Model evaluation
                                                                                                                                      • Hardware Specifications
                                                                                                                                        • Results
                                                                                                                                          • LSTM Performance
                                                                                                                                          • CNN Performance
                                                                                                                                            • Discussion amp Conclusion
                                                                                                                                              • The LSTM Network
                                                                                                                                                • Regression Analysis
                                                                                                                                                • Classification Analysis
                                                                                                                                                  • The CNN
                                                                                                                                                    • Regression Analysis
                                                                                                                                                    • Classification Analysis
                                                                                                                                                      • Comparison Between Both Networks
                                                                                                                                                      • Conclusion
                                                                                                                                                        • Future Work
                                                                                                                                                        • Bibliography

                                                                                                              TRITA TRITA-ITM-EX 2019606

                                                                                                              wwwkthse

                                                                                                              • Introduction
                                                                                                                • Background
                                                                                                                • Problem Description
                                                                                                                • Purpose Definitions amp Research Questions
                                                                                                                • Scope and Delimitations
                                                                                                                • Method Description
                                                                                                                  • Frame of Reference
                                                                                                                    • Filtration amp Clogging Indicators
                                                                                                                      • Basket Filter
                                                                                                                      • Self-Cleaning Basket Filters
                                                                                                                      • Manometer
                                                                                                                      • The Clogging Phenomena
                                                                                                                      • Physics-based Modelling
                                                                                                                        • Predictive Analytics
                                                                                                                          • Classification Error Metrics
                                                                                                                          • Regression Error Metrics
                                                                                                                          • Stochastic Time Series Models
                                                                                                                            • Neural Networks
                                                                                                                              • Overview
                                                                                                                              • The Perceptron
                                                                                                                              • Activation functions
                                                                                                                              • Neural Network Architectures
                                                                                                                                  • Experimental Development
                                                                                                                                    • Data Gathering and Processing
                                                                                                                                    • Model Generation
                                                                                                                                      • Regression Processing with the LSTM Model
                                                                                                                                      • Regression Processing with the CNN Model
                                                                                                                                      • Label Classification
                                                                                                                                        • Model evaluation
                                                                                                                                        • Hardware Specifications
                                                                                                                                          • Results
                                                                                                                                            • LSTM Performance
                                                                                                                                            • CNN Performance
                                                                                                                                              • Discussion amp Conclusion
                                                                                                                                                • The LSTM Network
                                                                                                                                                  • Regression Analysis
                                                                                                                                                  • Classification Analysis
                                                                                                                                                    • The CNN
                                                                                                                                                      • Regression Analysis
                                                                                                                                                      • Classification Analysis
                                                                                                                                                        • Comparison Between Both Networks
                                                                                                                                                        • Conclusion
                                                                                                                                                          • Future Work
                                                                                                                                                          • Bibliography

                                                                                                                top related