PHYSICS-INFORMED NEURAL NETWORKS ... - DiVA Portal

Master thesis, 300 hp

Master of Science in Engineering Physics

Spring term 2021

PHYSICS-INFORMED NEURAL NETWORKS FOR BIOPHARMA

APPLICATIONS Linnéa Cedergren

Physics-Informed Neural Networks for Biopharma Applications

Copyright © Linnéa Cedergren, 2021.

Author:Linnéa Cedergren, [email protected]

Supervisors:Rickard Sjögren, Sartorius Stedim Data Analytics ABBrandon Corbett, Sartorius Canada Inc.Adrian Hjältén, Umeå University

Examiner:Martin Servin, Umeå University

Master of Science Thesis in Engineering Physics, 30 ECTSDepartment of PhysicsUmeå UniversitySE-901 87 Umeå, Sweden

i

Abstract

Physics-Informed Neural Networks (PINNs) are hybrid models that incorporatedi�erential equations into the training of neural networks, with the aim of bringingthe best of both worlds. This project used a mathematical model describing aContinuous Stirred-Tank Reactor (CSTR), to test two possible applications ofPINNs. The first type of PINN was trained to predict an unknown reaction ratelaw, based only on the di�erential equation and a time series of the reactor state.The resulting model was used inside a multi-step solver to simulate the systemstate over time. The results showed that the PINN could accurately model thebehaviour of the missing physics also for new initial conditions. However, themodel su�ered from extrapolation error when tested on a larger reactor, with amuch lower reaction rate. Comparisons between using a numerical derivative orautomatic di�erentiation in the loss equation, indicated that the latter had ahigher robustness to noise. Thus, it is likely the best choice for real applications.A second type of PINN was trained to forecast the system state one-step-aheadbased on previous states and other known model parameters. An ordinaryfeed-forward neural network with an equal architecture was used as baseline. Thesecond type of PINN did not outperform the baseline network. Further studies areneeded to conclude if or when physics-informed loss should be used inautoregressive applications.

ii

Physics-informed Neural Networks for Biopharma Applications

Contents

1 Introduction 1

2 Theory 2

2.1 Feed-Forward Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Training a Feed-Forward Neural Network . . . . . . . . . . . . . . . . . . . 3

2.2.1 The Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2.2 Computation of Loss . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2.4 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.5 Division of Dataset into Training, Validation and Testing . . . . . . 72.2.6 Feature Standardization . . . . . . . . . . . . . . . . . . . . . . . . 72.2.7 Automatic di�erentiation . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Physics-Informed Neural Networks . . . . . . . . . . . . . . . . . . . . . . 72.4 Missing Physics in Di�erential Equations . . . . . . . . . . . . . . . . . . . 92.5 Numerical Methods for Ordinary Di�erential Equations . . . . . . . . . . . 102.6 Using a Neural Network to Predict One-Step-Ahead . . . . . . . . . . . . . 112.7 Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Method 12

3.1 Continuous Stirred-Tank Reactor Model . . . . . . . . . . . . . . . . . . . 123.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Networks for Predicting Missing Reaction Rate Law ra . . . . . . . . . . . 14

3.3.1 Comparison Between the ra Models . . . . . . . . . . . . . . . . . . 153.4 One-Step-Ahead Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4.1 Using a Numerical Derivative in the Loss . . . . . . . . . . . . . . . 163.4.2 Comparisons Between the One-Step-Ahead Models . . . . . . . . . 17

4 Results 17

4.1 Predicting the Reaction Rate Law ra . . . . . . . . . . . . . . . . . . . . . 174.1.1 Simulations with New System Settings . . . . . . . . . . . . . . . . 204.1.2 Using Fewer Measurements per Batch . . . . . . . . . . . . . . . . . 214.1.3 A Comparison Between Di�erent Noise Levels . . . . . . . . . . . . 224.1.4 Using Fewer Training Batches . . . . . . . . . . . . . . . . . . . . . 24

4.2 One-Step-Ahead Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.1 Simulations for Multiple Time Steps . . . . . . . . . . . . . . . . . 25

5 Discussion 26

5.1 Predicting the Reaction Rate Law ra . . . . . . . . . . . . . . . . . . . . . 265.2 Predicting One-Step-Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . 275.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

iii


1 Introduction

Deep learning methods based on artificial neural networks (ANNs) have proven to beuseful in a variety of biopharma applications. Examples include biomedical imaging [1],drug discovery, pharmaceutical product development, clinical trial design,pharmaceutical manufactoring [2][3] and bioprocess monitoring [4].

An issue with ANNs is that the whole model is black-boxed and not interpretable [5].This can be beneficial if no information is known about the system, however usually themechanics are known to some extent. Fully mechanistic models on the other hand, arefully interpretable, although might not capture the whole complexity of the system. Forexample, incomplete knowledge about metabolite reactions in a bioreactor, can result inlarge prediction errors [6]. Furthermore, some part of the equations might be unknownand not possible to derive from first principles [7].

Combining the mechanistic models with data driven methods is referred to as hybridmodelling. Hybrid models have outperformed previously used statistical models, forexample when predicting the time evolution of a bioprocess [8]. A new way of combiningthe mechanistic equations and data-driven methods are so called Physics-informedneural networks (PINNs). The main idea of PINNs is to include the information givenby di�erential equations, into the training of the neural network [9]. This can enablemodels to predict quantities that are not directly measurable, or help improve themodel’s generalizability by providing more information. Furthermore, the biopharmaindustry is concerned with small data sets, due to a high cost of data acquisition. PINNscould reduce the need of large training sets, compared to previously used ANNs [10].

This project aims to investigate possible benefits and drawbacks of PINNs, for problemsrelevant to the biopharma industry. More specifically, this project will use a ContinousStirred-Tank Reactor model to test two possible applications. To begin with, a PINNwill be implemented to predict an unknown and hidden reaction rate law. Furthermore,a PINN will be trained to predict the future reactor states based on the previous. Tofurther limit the complexity of the project, only Feed-forward neural networks are used.While Recurrent neural networks (RNNs) have proven to be useful for time seriesmodelling [11] and for solving ordinary di�erential equations [12], the usage of suchnetworks is considered beyond the scope of this project.

1


2 Theory

The following sections will provide the theory behind artificial neural networks ingeneral and physics-informed neural networks in particular.

2.1 Feed-Forward Neural Networks

A neural network is built up by simple units that input values, apply a function to themand output the result. The units are often referred to as nodes and are arranged inlayers, with each layer connected to the previous and following. Each node inputs aweighted combination of the previous layer’s outputs and applies a mathematicalfunction on the linear combination. The resulting output from the node is then sentforward to the subsequent layer [13, p. 169]. Figure 1 shows a schematic picture of anode, which inputs the linear combination

z = w€

x + b, (1)

where w are called the weights, b the bias and x is the previous layer’s output. Theoutput from the node is

y = „(z), (2)

where „ is a nonlinear function called the activation function. The non-linearity of theactivation function enables the network to map more complex relations [13, p. 172].

Figure 1: A sketch of a single node which inputs the linear combinationx1w1 + x2w2 + x3w3 + b and passes it through the activation function „.

Common examples of activation functions are the Rectified Linear Unit (RELU),

„(z) = max(0, z) (3)

and Exponential Linear Unit (ELU) [14],2


„(z) =(

exp(z) ≠ 1, z < 0z, z Ø 0.

(4)

Figure 2 represents a simple form of a neural network, called a feed-forward neuralnetwork (FFNN), as a directed acyclic graph. The example in the figure has one inputlayer, two hidden layers and one output layer. The number of neurons in the hiddenlayers is called the width of the model [13, p. 169], giving the network in figure 2 awidth of 3.

Input layer Hidden layers Output layer

Bias

Figure 2: A graph of a FFNN with 2 input nodes, 2 hidden layers with3 nodes each and 2 output nodes.

2.2 Training a Feed-Forward Neural Network

The goal of the training in a regression problem is for the network to approximate thetrue map between the input and output, with as high accuracy as possible [13, p. 168].The weights are set randomly at initialization, and are adjusted as the model is trainedto find the weights that minimize the error.

Training a FFNN is an iterative process, consisting of four main steps:

1. Forward pass

2. Loss computation

3. Backpropagation

3


4. Weight optimization.

The combination of all four steps constitutes an epoch.

2.2.1 The Forward Pass

The forward pass is when the input data is fed through the input layer and passedforward through the hidden layers before it finally passes the output layer. The outputvalue is the network’s prediction of the target value, which is then compared to thetarget. In other words, it is by feeding input forward through the network, that thenetwork makes a prediction.

2.2.2 Computation of Loss

The distance between the target value and the network prediction can be measured inseveral ways. A common loss function for regression problems is the mean squared error,

MSE = 1Nx

Nxÿ

i=1(xNN

i ≠ xi)2, (5)

for the network prediction xNN and the actual target value x. In this report, this type of

loss is sometimes referred to as the MSE data loss, since it involves only the target valuex and not a di�erential equation.

2.2.3 Backpropagation

During the backpropagation, or backward pass, one computes how the loss depends onthe network weights. More specifically, one computes the gradients of the loss withrespect to all the di�erent network weights. This is done by utilizing the chain rule ofcalculus. Let y = f(g(x)) and z = g(x), then

dy

dx= dy

dz

dz

dx, (6)

by the chain rule.

To begin with, consider a very simplified network example with only one node in eachlayer. The final node inputs the linear combination z, passes it through an activationfunction „ to give the output y,

z = xw + b, (7)y = „(z). (8)

4


Let the loss function be MSE, so that

L(w) =�y(z(w)) ≠ y

ú�2, (9)

where yú is the target value. If we wish to compute the derivative of the loss with

respect to the weight w, we can apply the chain rule,

ˆLˆw

= ˆLˆy

ˆy

ˆz

ˆz

ˆw. (10)

Since z depends linearly on w, we have by inspection that ˆz/ˆw = x. Furthermore,ˆy/ˆz is just the derivative of the activation function with respect to its input. If we forsimplicity choose ReLU,

ˆy

ˆz= ˆ„

ˆz(z) =

(0 z < 01 z Ø 0.

(11)

Finally,ˆLˆy

= 2(y ≠ yú)1 · 1, (12)

which gives thatˆLˆw

=(

0 z < 02(y ≠ y

ú)x z Ø 0.

(13)

If the input x in its turn would be the output from a previous node, one can apply thesame framework recursively.

Increasing the complexity slightly, let w œ Rm, z œ Rn, and let g : Rm æ Rn and

f : Rn æ R1. Then similarly forz =g(w),y =f(z),

(14)

the chain rule generalizes to vector form as

Òwy =Å

ˆz

ˆw

ã€

Òzy, (15)

where ˆz/ˆw œ Rn◊m is the Jacobian matrix of g. However, in reality the weights wouldhave even higher dimensions. Fortunately, the chain rule generalizes also to tensors [13,pp. 206-207].

5


2.2.4 Gradient Descent

The gradients computed during the backpropagation are then used to optimize theweights. The optimization algorithms derive from an algorithm called gradient descent,which is an iterative algorithm. In short, the algorithm proposes a new point w

Õ by

wÕ = w ≠ ‘ÒwL(w), (16)

where ‘ is the chosen learning rate [13, p. 85]. The learning rate is a hyperparameterwhich has a great impact on the performance of the model. A too large learning ratecould be numerically unstable and not be able to find the minimum, while a too smalllearning rate will converge very slowly [15]. Figure 3 gives an example of how the lossmight depend on one weight and how the weight is updated towards the minimum.

Initial value

Global minimum

Local minimum

Figure 3: A plot of the loss over one weight, where the curve has onelocal minimum and one global minimum.

The classic gradient descent algorithm updates the weights after forwarding the wholetraining set. Stochastic gradient descent (SGD) on the other hand, uses one randomtraining sample and updates the weights. As the name suggests, this gives a morestochastic behaviour with a higher variance. The combination of the two is calledmini-batch SGD, which uses a randomly sampled mini-batch smaller than the wholetraining set [13, p. 152].

Nowadays, the optimizers often use momentum and an adaptive learning rate. Acommonly used optimizer is Adam [16].

6


2.2.5 Division of Dataset into Training, Validation and Testing

To assess the error of the final model, the dataset has to be divided into two parts, oneused for training and one for testing. The test set should never be presented to themodel during training. This enables tests for how well the model generalizes to newunseen data. When the model fits the training data perfectly, but has a high test error,is called overfitting [13, p. 110]. To reduce the risk of overfitting, one can split thetraining set into one training set and one validation set. The validation set is not usedto update the weights, however the validation loss can be used to determine when tostop the training. When the validation loss ceases to decrease, the training is stopped.This is referred to as early stopping [13, pp. 246-247]. The validation set is also usedwhen optimizing the model hyperparameters [13, p. 198].

2.2.6 Feature Standardization

The data usually needs some pre-processing before entering the network. It is commonto standardize the input and output by subtracting the mean value and dividing by thestandard deviation,

xÕ = x ≠ x̄

s, (17)

where x̄ and s are the sample mean and standard deviation of the training set. Onceagain, the values of the test set should not be disclosed. The standardization is done tomake the inputs look more like standard normal variables, and is performed regardlessof the underlying distribution [17]. If the inputs are on very di�erent scales and notstandardized, the training might not converge as fast and need a much lower learningrate to find the global minimum [18].

2.2.7 Automatic di�erentiation

Automatic di�erentiation is essentially the same as back propagation, described insection 2.2.3. However, instead of taking the gradient of the loss with respect to theweights, one can take the gradient of the network output with respect to the networkinput.

2.3 Physics-Informed Neural Networks

The aim of PINNs is to explicitly include di�erential equations into the training of aneural network. In practice, this is done by modifying the computation of the loss.

Consider a general formulation of an ordinary di�erential equation

dx

dt= g(u, x, t), t œ [0, T ],

x(0) = x0, x0 œ R1(18)

7


where g(u, x, t) is a known function. The vector u corresponds to the controllable andknown parameters of the system and x is a state variable that describes the system stateover time t. Assume that there are available measurements of x over several times t.

In the ordinary FFNN setting, one would train a network xNN to map between inputs

u(ti), ti and target x(ti). The mean squared error data loss is then

Lx(◊) = 1Nx

Nxÿ

i=1

�x

NN(ui, ti; ◊) ≠ x(ui, ti)�2

, (19)

where Nx are the measured values of the system state x and ◊ the network weights.

If one applies a function f(u, x, t) on the output xNN , it can be seen as adding one extra

network layer and thus defining a new network. The new final layer has weight 1, bias 0and activation function f , see figure 4. In PINN articles, this is sometimes referred tomathematically as an auxiliary network f

NN , which shares network weights andparameters with x

NN [19].

xNN output f NN output

f

Figure 4: A sketch which shows how one extra layer is added to theoutput of xNN . The resulting network is defined as the auxiliary networkfNN .

Consider the function f as the left hand side minus the right hand side of equation (18),

f = dx

dt≠ g(u, x, t) = 0. (20)

When applied on the output of xNN it becomes

fNN(u, x

NN, t) = dx

NN

dt≠ g(u, x

NN, t), (21)

8


where the derivative dxNN

/dt can be obtained by automatic di�erentiation of xNN with

respect to t [10]. The loss of fNN is computed as

Lf (◊) = 1Nx

Nxÿ

i=1

�f

NN( · ; ◊) ≠ 0�2

, (22)

where ◊ are the weigths of the xNN network. Since f

NN is essentially the same networkas x

NN , although with one extra layer, the network weights ◊ are shared. The combinedloss is

L(◊) = w1Lx(◊) + w2Lf (◊), (23)

where the weights w1 and w2 are additional adjustment parameters that are tuned withrespect to how much they should influence the optimization [9]. Finally, the x

NN

network is trained by minimizing the total loss by,

◊̂ = arg min◊

(L(◊)), (24)

with help of stochastic gradient descent, usually with the optimizer Adam.

The above reasoning generalizes directly to systems of ODEs, where each availableequation contributes with a new loss term Lf1 , Lf2 etc [19]. The system state is then avector x of state variables and x

NN has multiple outputs.

2.4 Missing Physics in Di�erential Equations

Now let us define a new di�erential equation

dx

dt= g(u, x, t, ⁄), t œ [0, T ],

x(0) = x0, x0 œ R1(25)

where g is a known function, and ⁄ is an unknown relation that can depend non-linearlyon u, x and t. As before, we assume to know the controllable parameters u and have atime series of measurements of the state x. The values of ⁄, however, are unknown andnot measurable. ⁄ might also not be separable from the rest of the equation. It is oftennot possible to derive the constitutive relation ⁄ from first principles [7].

In contrast to a PINN, an ordinary FFNN would need a measured or computed targetvalue to train against. Let ⁄

NN be a network which inputs ui, xi, ti. Let f be the9


di�erence between the left hand side and the right hand side of equation (25), appliedon x

NN, ⁄

NN as

fNN(u, x

NN, t, ⁄

NN) = dxNN

dt≠ g(u, x

NN, t, ⁄

NN). (26)

Then, a possible loss function is

Lf (◊) = 1Nx

Nxÿ

i=1

�f

NN( · ; ◊) ≠ 0�2

, (27)

where ◊ are the model weights of the ⁄NN network.

In this context, xNN preferably inputs only t, since its only purpose is to provide the

loss function with the derivative dxNN

/dt [7]. The xNN is trained on MSE data loss and

can be used to interpolate between measurements xi, although it cannot extrapolate orgive valuable predictions for new initial conditions.

Furthermore, it might be possible to use a numerical derivative instead, so that nox-network is needed to train the ⁄ network. In that case, the f

NN in (27) becomes

fNN(u, x, ⁄

NN, t) = �x

�t≠ g(u, x, ⁄

NN, t). (28)

By minimizing the loss of equation (27), the network ⁄NN can be optimized to predict

the hidden values of ⁄ [7]. The trained ⁄NN makes it possible to solve the ODE by

standard methods, by substituting for ⁄NN in the right hand side of equation (25).

2.5 Numerical Methods for Ordinary Di�erential Equations

To derive a numerical solution to equation (25), the time domain is first divided into N

subsets. The function dx/dt is integrated on each subinterval [tn, tn≠1],

x(tn+1) ≠ x(tn) =⁄ tn+1

tn

g(u, x(t), ⁄, t) dt, (29)

for n œ [1, ..., N ≠ 1]. Using the mean value theorem for integrals, equation (29) can bewritten as

x(tn+1) ≠ x(tn) = h · g(u, x(›), ⁄, ›), (30)

10


where h denotes the length of a subinterval and › is a time in [tn, tn+1]. The functionvalue at › is approximated by a linear combination of the function values at several timepoints,

x(tn+1) = x(tn) + h

mÿ

i=1cig(u, x(›i), ⁄, ›i), (31)

where the parameters ci, m and ›i can be chosen in several ways to obtain di�erent ordermethods [20]. The Python package scipy’s function solve_ivp uses Runge-Kutta 45 asits default, which is a fourth order method [21].

2.6 Using a Neural Network to Predict One-Step-Ahead

A FFNN can be used to predict future states, by training it to map the previous systemstates to the following. Let xn, n = 1, 2, ..., N be a time series of N measurements ofthe state variables x. A neural network can be trained to predict the target xn+1 fromthe previous k steps, xn≠k, xn≠(k≠1), ...xn [22]. The value of k is referred as the lag, amodel using the k previous values thus has k lagged values [23].

2.7 Statistical Inference

There are several methods for statistical inference on paired samples, depending on theunderlying assumptions about the distributions. If the populations can not be assumedto be normally distributed, it is possible to use a nonparametric test such as theWilcoxon signed-rank test. It tests whether the distribution of the di�erences betweenthe pairs is symmetric around zero. The result of the test is a probability value (p-value)stating how probable the observations are given the assumptions of the test. It can beseen as the nonparametric equivalent to a paired t-test [24]. A commonly used thresholdfor when the test shows a significant di�erence between the populations is for p < 0.05.

11


3 Method

3.1 Continuous Stirred-Tank Reactor Model

To investigate the di�erent usages of PINNs, this project used a model describing astirred-tank reactor. Figure 5 provides an overview of the main components of theCSTR model.

V

!"!"#

!"$#

%

#&$&!"

#&$&

Figure 5: A schematic picture of the CSTR model of a tank withvolume V . The concentration of reactant A is Cain at the inlet and Ca

at the outlet. The flowrate is F and the outlet temperature T . Thecooling water flows at rate Fc and initially has temperature Tcin andhas outlet temperature Tc.

The Cain is the inlet concentration of reactant A, which flows at a variable rate F . Ca

is the outlet concentration and T is the temperature of the outlet. The cooling waterflows at a variable rate Fc and has inlet temperature Tcin and outlet temperature Tc.The model describes how Ca, T and Tc varies over time by the equations

dCa

dt= F

V(Cain ≠ Ca) ≠ ra, (32a)

dT

dt= F

V(Tin ≠ T ) + Ua

V · Cp · fl(Tc ≠ T ) + Hrxn

(fl · Cp)ra, (32b)

dTc

dt= Fc

V c(Tcin ≠ Tc) + Ua

V c · Cpc · fl(T ≠ Tc), (32c)

12


where ra is the reaction rate of reactant A. A list of the remaining parameters withunits is shown in table 1 together with the values used for simulations. The rates F andFc are governed by equations depending on the adjustable set points Casp and Tsp

respectively. These set points were chosen to have a stochastic behaviour. In every 10time steps, a random number was drawn from a uniform distribution and added to theprevious value of the set point. Due to this, di�erent simulations with the same initialconditions give rise to very di�erent time series.

The reaction rate law considered ground truth in this experiment is defined as

ratrue = k · Ca2 · exp

ÅE

R

� 1Tref

≠ 1T

�ã. (33)

For the experiment with missing physics, the reaction rate ra is assumed to be unknownand not measurable.

Table 1: A table with the paramters and their units, together with thevalues used during the simulations.

Parameter Value (unit)

Ua 50 -Cp 4.2 (kj/kg · c)fl 1.0 (kg/l)Hxrn 63.9 (kj/kmol)Cpc 4.2 (kj/kg · c)E 110 (kj/kmol)k 0.8 -R 8.314 (kj/kmol · k)Tref 25 (¶C)Tin 45 (¶C)Tcin 20 (¶C)V 20.0 (l)V c 5.0 (l)

3.2 Datasets

A time series of x = [Ca, T, T c] and u = [F, Fc, Tsp, Casp] at specified time stamps wasgenerated from equation (32) with the Python package Scipy’s solve_ivp. Theduration was set to t œ [0, 20] hours with a time step of h = 1/6. A total of 40 batcheswere simulated, all with initial conditions x0 = [0.5, 60, 22].

13


The dataset was split into training, validation and testing and had the dimensionsshown in table 2. This is referred to as dataset A.

Table 2: Dimensions for the full dataset used for a majority of theexperiments in this project. This is named dataset A.

Training Validation Testing

No. batches 24 8 8Total no. observations 2880 960 960

To make the data more realistic, white gaussian noise was added to the dataset. Thesignal to noise ratio was set to 80, meaning that the standard deviation of the noise wasx̄/80, where x̄ is the mean value vector of the state variables.

To be able to test whether the frequent measurements were needed, a second datasetwas created by choosing every 6:th measurement from dataset A. This dataset isreferred to as dataset B and has the dimensions shown in table 3.

Table 3: Dimensions for the smaller dataset with every 6:th value fromdataset A. It is referred to as dataset B.

Training Validation Testing

No. batches 24 8 8Total no. observations 480 160 160

3.3 Networks for Predicting Missing Reaction Rate Law ra

Two separate PINNs were created with an equal network setup with PyTorch forPython. Each had an input layer size of 2, one hidden layer with 50 nodes and oneoutput layer with one node. The hidden layer had a dropout probability of 0.2, meaningthat 20 % of the connections were randomly shut o� during training. The activationfunction was ELU and the mini-batch size set to 120.

The first model, which is referred to as raNNnumder had measured input [Caj, Tj] at time tj,

and was trained using the loss function

Lf (◊) = 1Nx

Nxÿ

j=1

Å�Ca

�t

��j

≠ Fj

V(Cain ≠ Caj) + ra

NNnumder

�Caj, Tj; ◊

�ã. (34)

The numerical derivative was computed with numpy.gradients.14


The second model used a neural network to predict the derivative by automaticdi�erentiation. A network Ca

NN was created with input size 1, two hidden layers with50 nodes each and one output layer. The activation function was elu and the dropoutprobability was set to 0.1. This network was pre-trained for 20 epochs using the loss

LCa = 1NCa

NCaÿ

n=1

�Ca

NN(tn) ≠ Can

�2, (35)

where Can corresponds to the measurement of Ca at time tn. The second ra model,which is referred to as ra

NNauto used the loss function

Lf (◊) = 1Nx

Nxÿ

n=1

ÅdCa

NN

dt

��n

≠ Fn

V

�Cain ≠ Ca

NN(tn)�

+ raNNauto

�Can, Tn; ◊

�ã. (36)

All models were optimized with the optimizer Adam and both ra models were trainedwith early stopping.

Since optimization of hyperparameters was not a priority, no optimization algorithm wasapplied to the dropout probability, to the size of the network etc.

3.3.1 Comparison Between the ra Models

The two raNN models were tested and compared in several aspects. First of all, their

ability to predict the ground truth ra in 8 unseen test batches was compared. Furthertesting included running simulations where the system of ODEs was solved usingsolve_ivp with the ground truth ra exchanged for the network prediction at each timepoint. These simulations were made for di�erent initial conditions a doubled durationcompared to the training data.

Moreover, the sensitivity to noise level in the training data and the influence of thenumber of available training batches was investigated. The models were also trained onthe smaller dataset B and compared. Finally, the simulation was run with a volume 100times larger compared to the volume used when training the models.

3.4 One-Step-Ahead Network

To investigate the use of PINNs in forecasting models, a feed-forward network withautoregressive input was implemented. Figure 6 illustrates the network setup. Theprevious two time steps’ measurements of x, u and time from dataset A were used asinput. The target was the subsequent measurement xn+1.

15


NN

Figure 6: A sketch of the input and outputs from the one-step-aheadnetworks.

Two identical networks were created to compare the results of PINN loss to a baselinewith ordinare MSE loss. The networks had 16 nodes in the input layer, 5 hidden layerswith 32 nodes and an output layer with 3 nodes. The dropout probability for the hiddenlayers were 0.1 and the activation function elu. The mini-batch size was set to 500.

The baseline network was trained using ordinary MSE data loss, with the networkprediction x

NNn+1 and the measured value xn+1 from dataset A. The model was trained

until convergence, and the learning rate was optimized by a parameter sweep. Thefunction torch.autograd.grad was used for automatic di�erentiation, and the networktrained on the combined loss described in equation (23), with

Lf = 1Nx ≠ 2

Nxÿ

n=3

Ådx

NN1

dt

-----n

≠ Fn

V

�Cain ≠ x

NNn,1

�+ ran

ã(37)

where xNNn,1 corresponds to the prediction of the n:th value of Ca, and ra = ratrue from

equation (33). The weight w1 was set to 1 and w2 was optimized together with thelearning rate in a grid search, and set to w2 = 10.

3.4.1 Using a Numerical Derivative in the Loss

A modified version of the models were also created, to compare the performance using anumerical derivative instead of torch.autograd.grad. The second term of the lossfunction was

Lf = 1Nx ≠ 2

Nxÿ

n=3

Å�x1�t

-----n

≠ Fn

V

�Cain ≠ x

NNn,1

�+ ran

ã(38)

with �x1/�t computed by numpy.gradients. These models did not have explicit timestn and tn≠1 as input as it was not needed for the loss computations.

16


3.4.2 Comparisons Between the One-Step-Ahead Models

The errors for the one-step-ahead predictions were computed by the mean absolutepercentage error,

MAPE = 100Nx ≠ 2

Nx≠1ÿ

j=2

��x

NNj+1 ≠ xj+1

xj+1

�� (39)

where each prediction xNNj+1 was made based on the previous two measurements [xj, uj]

and [xj≠1, uj≠1]. Furthermore, their ability to predict multiple steps in a row wasinvestigated by letting the models forecast again, based on their prior prediction. Eachx

NNj+1 prediction was made based on [xNN

j , uj] and [xNNj≠1 , uj≠1], stepping forward

successively from the two initial values. The errors were computed by the MAPE forthis comparison as well.

4 Results

4.1 Predicting the Reaction Rate Law ra

A comparison with the ground truth ra values for the testdata in figure 7, show thatboth ra

NNs have a high predictive accuracy for the test batches of dataset A. Thedi�erence in precision between the models is visible primarily for the low ra values, inthis context corresponding to the start of a new batch. The MSE for all the 8 testbatches and the mean MSE is presented in table 4. The MSE is significantly lower forra

NNnumder, when tested with a one-sided Wilcoxon signed-rank test (p = 7.81 · 10≠3).

Figure 7: Predictions and ground truth ra over all the observations inthe testdata, consisting of 8 batches with the same initial conditions asthe training data.

17


Table 4: The MSE for the 8 test batches of dataset A.

Batch raNNnumder ra

NNauto

1 3.62·10≠4 2.00·10≠3

2 5.64·10≠3 4.79·10≠3

3 1.72·10≠3 3.99·10≠3

4 2.48·10≠4 4.68·10≠3

5 3.42·10≠4 1.28·10≠3

6 1.92·10≠3 3.24·10≠3

7 1.99·10≠4 1.47·10≠3

8 2.39·10≠4 4.20·10≠3

Mean 1.33·10≠3 3.21·10≠3

The scatter plots in figure 8 reveal that the raNNauto predictions are more widely spread

around the true ra, compared to raNNnumder. However, the tail for Ca > 1.25 is deviating

slightly more for the latter one. The range of the test Ca values in this plot is alsorepresentative to the values present in the training data. The minimum and maximumvalues of the state variables in the training set is presented in table 5. The error isexpected to be low for predictions within the range of the training set, while predictionsoutside the range are linked with an extrapolation error.

Figure 8: Scatter plots with the predicted ra and ground truth ra overCa, for 8 test batches with the same initial conditions as the trainingbatches. The predictions made by raNN

numder are in general less spreadaround the ground truth compared to raNN

auto, apart from for high valuesof Ca.

18


Table 5: Minimum and maximum values of the two state variables thatare input to the ra networks, in the training data with 1.25% noise.

Variable Min Max

Ca 0.483 1.432T 55.409 69.252

Finally, a comparison between the derivatives shows that the numerical derivative variesa lot more than the autograd version. Figure 9 shows the values for all the derivativesfor a randomly sampled training batch. In general, the numerical derivative capturesmore of the variability of the true derivative, which is computed as the right hand sideof equation (18). For example, the numerical derivative has a higher amplitude in theinitial peak, making it closer to the true value. However, there are also erronousvariations in the numerical derivative, for example in the final observation.

Since the system is excited with random set points, the 24 training batches are quitedi�erent - despite having the same initial conditions. As a result, the network Ca

NN istrained towards predicting the average time evolution of Ca. Therefore, the derivative ismuch smoother compared to the true derivative computed from the data with 1.25%noise. Most of the values of the autograd derivative in figure 9 are in the range around≠1 · 10≠3.

observation

Figure 9: A plot of the derivatives over one training batch. The righthand side of equation (18) with the true ra is the ground truth.

19


4.1.1 Simulations with New System Settings

To further illustrate the di�erences between the network performances, figure 10 shows asimulation of the ODE with the ra networks. The initial conditions are changedcompared to those used for the training data, and the duration is twice as long as thetraining data’s. Both models follow the exact solution well and are at several pointshard to distinguish from the true values.

Figure 10: A plot with three solutions to the ODE, where the ra termis either the ground truth ratrue, or predicted by raNN

auto or raNNnumder

respectively. The initial values are [Ca0, T0, T c0] = [1, 80, 18] and theduration is twice that of the training data.

Figure 11 shows the solutions computed with a volume of V = 2000 instead of V = 20with all other parameters equal to the training data. Both the solution with ra

NNauto and

that with raNNnumder diverge completely for Ca and T . Since both equation (32a) and

(32b) contain ra, this could be expected if the predicted value of ra is inaccurate. Asequation (32c) depends on T , an erronous value of T should also give a slightly loweraccuracy for Tc.

20


Figure 11: Simulation of the system with volume V = 2000 which is100 times larger than the volume used to produce the training data. Thevalues of Ca are well below the range of those present in the trainingdata, giving a large extrapolation error.

4.1.2 Using Fewer Measurements per Batch

The scatter plot in figure 12 is based on dataset B with fewer measurements per batch.While ra

NNnumder does seem to underestimate the ra value by more for Ca > 1.25, there is

no significant di�erence between the models in total, since p = 0.27 for a one-sidedWilcoxon signed-rank test. Figure 13 shows a simulation with new initial conditions anddouble time interval. The performance of both models is almost in line with the fulldata experiment in section 4.1, indicating that the error is still at an acceptable level.

21


Figure 12: Scatter plots with the predicted ra and ground truth ra overCa, for 8 test batches with the same initial conditions as the trainingbatches. The raNN

numder underestimates the ra value by quite much whenCa > 1.25.

Figure 13: A simulation of the state variables for one test run. Theinitial conditions were x0 = [1, 80, 18] and the duration doubled com-pared to the training data. The errors are in the same range as in thefull data experiment.

4.1.3 A Comparison Between Di�erent Noise Levels

Added white gaussian noise was used to simulate noisy training and testing data. Acomparison for noise level = [2, 5, 10, 15, 20] %, corresponding to SNR = [50, 20, 10,6.67, 5] is presented in figure 14. As previously, there are 8 test batches and the plotshows the average. The left hand side shows a log-log plot for noise level > 5%. A

22


polynomial fit revealed that the error grows quadratically when the noise is above 5 %.The MSE of ra

NNauto is significantly lower, when tested with the Wilcoxon signed-rank test

on all the MSE estimates (p = 1.3 · 10≠3), indicating a higher robustness to noise.

Figure 14: Plots of the average MSE (over all 8 test batches) of thetwo models for di�erent noise levels.

Figure 15 shows a simulation with the models trained on data with SNR = 10,corresponding also to 10% noise level. The predictions are divergent in the end of thesimulation for both models.

Figure 15: A plot with the simulated system, using the ra modelstrained on data with 10 % white gaussian noise. The initial conditionswere x0 = [1, 80, 18] and the duration twice as long as for the trainingdata.

23


4.1.4 Using Fewer Training Batches

To investigate the scenario of having a smaller dataset, fewer training batches werechosen from dataset A. Bar plots with the average MSE over the 8 test batches fordi�erent sizes of training data are seen in figure 16. The baseline for each model wastheir average MSE for the full dataset. When comparing all the 40 MSE estimates, thenumerical derivative had a lower average error when tested with the Wilcoxonsigned-rank test (p = 9.3 · 10≠3). However, the results are the opposite when comparingthe models to their baseline. The MSE of ra

NNauto was stable around the baseline for all

training set sizes apart from the smallest, where the error was doubled. In contrast, thera

NNnumder MSE was doubled already for 12 training batches and the error for the smallest

training set was 4.5 times the baseline. This indicates that the performance of raNNauto is

less stable, when the size of the training set is decreased.

Figure 16: Bar plots with the average MSE for the di�erent levelsof training batches. The dashed line corresponds to the MSE obtainedwhen using the full dataset A.

4.2 One-Step-Ahead Predictions

Figure 17 shows a plot over the numerical and autograd derivatives for a training batchafter finished training. It is obvious that the autograd derivative is erroneous, as it isclose to 0 for the whole batch. This is likely a consequence of the network setup withthe previous values of x as input. Since there is more relevant input compared to theexplicit time vector t, the gradient back to the time leaf node is almost 0. Therefore, thefollowing PINN results were produced using a numerical derivative in the equation.

The one-step-ahead errors averaged over the testbatches are presented in table 6. Notethat for Ca and T , the models di�er only slightly. However, for Tc the PINN’s error wasmuch higher, giving a higher average error. The Wilcoxon signed-rank test showed thatthe average error was in fact significantly higher for the PINN (p= 0.023).

24


Table 6: Results from one-step-ahead predictions on the whole testsetof dataset A. The table shows the mean MAPE over all the test batchesfrom dataset A in percent.

Baseline MAPE (%) PINN MAPE (%)

Ca 1.32 1.29T 0.50 0.54Tc 1.00 1.18Total 0.94 1.00

Observation

dCa/

dt

Figure 17: A plot of the two di�erent versions of derivative approxi-mations for dCa/dt over one training batch, after training.

4.2.1 Simulations for Multiple Time Steps

An example of multiple time step predictions is shown in figure 18. The mean error in %over the 8 batches is found in table 7. When comparing all test errors batchwise, therewas no significant di�erence between the models (p= 0.95).

25


Figure 18: An example of multiple time step predictions for one of the8 test batches. The relative performance of the models varies slightlybetween test batches.

Table 7: Results from multiple step predictions from the two initialvalues. The table shows the mean error over all the test batches fromdataset A in percent. The total error over all batches is presented to-gether with the corresponding 95 % confidence interval limits.

Baseline MAPE (%) PINN MAPE (%)

Ca 10.15 9.41T 2.29 3.18Tc 6.27 6.71Total 6.23 6.28

5 Discussion

5.1 Predicting the Reaction Rate Law ra

Both of the trained raNN networks could accurately model the missing reaction rate law

for the test data. Simulations over time for di�erent initial conditions were alsosuccessful, where the solutions were sometimes not distinguishable from the groundtruth.

However, neither of the raNN networks performed well for the 100 times larger reactor,

26


with ra values far lower than the network was trained to predict. This was expected,since the region in which the model can extrapolate well is quite narrow. Including moretraining data with a broader spectrum of Ca values could make the predictions moreaccurate for the larger reactor. A hyperparameter optimization including the networkarchitecture could also address this issue, and see if a deeper network could generalizebetter.

In a real biopharma application, the dataset is likely both small and noisy. When usingfewer training batches from dataset A, the numerical derivative still had a significantlylower error. However, when comparing how the errors scaled in relation to the fulldataset MSE, the error grew much faster for the numerical derivative. Another problemwith real world datasets could be that the measurements are less frequent. A test withonly 20 measurements per batch, showed that while the MSE was lower for ra

NNauto, the

di�erence was not statistically significant. Even so, the fact that raNNauto was significantly

better for noisy data, and that the error did not grow as fast when decreasing thedataset size, suggests that it is the better option.

It is always valuable to question whether the model is over-engineered, and if a simplermodel is good enough. In this project, we tried to use a numerical derivative instead ofthe more complex automatic di�erentiation. The results show that the numericalderivative was in fact better for low noise data, though, it was less robust to noise.Given that the biopharma industry often deal with small and noisy datasets, theautomatic di�erentiation might be a more sensible choice in real world applications.

5.2 Predicting One-Step-Ahead

The torch.autograd.grad derivative was not accurate when the network setup waschanged to predict the future based on previous measurements. It was shown that ifmore inputs were added, the network prediction of the derivative collapsed to almostzero at all time points.

The one-step-ahead PINN with a numerical derivative still did not outperform thebaseline. It even underperformed when predicting Tc, making the average MAPEsignificantly higher for single time step predictions. One reason is found in the veryformulation of the loss, which has two MSE terms. The MSE data loss gives errors inCa, T and Tc equal weights. The second term only involves Ca and T , giving them ahigher total weight. Since the network is optimized to find the lowest total loss, thismight be a contributing factor to why the physics-informed loss did not improve theaccuracy.

Worth noting is also that it is not possible to use standardized values in thephysics-informed loss term. Since Ca and T were an order of magnitude di�erent inscale, a small change in the weights could impact one of the outputs more than the

27


other. This could also make it harder for the optimizer to find the global minimum loss.

Nevertheless, since there are no published articles using this particular network setup, itis hard to draw any conclusions about why it did not improve the accuracy. Using morecomplex network setups, such as RNNs, could possibly be more suitable compared toFFNNs. To explore whether physics-informed losses can benefit other autoregressiveapplications or di�erent network setups is left for future work.

5.3 Conclusions

To conclude, a physics-informed neural network could accurately model an unknownreaction rate law, in a Continuous Stirred-Tank Reactor model. The PINN wassuccessfully trained to predict the value of the missing relation, given only the systemstate and the ODE. Simulations of the system state over time also showed a highprecision, when the missing part of the equation was replaced by the network in amulti-step solver. More specifically, simulations for new initial conditions and longerduration showed a high accuracy. However, when the volume was increased by a factorof 100, the simulation diverged due to extrapolation error. Comparisons between twoderivative approximations showed that the network trained with automaticdi�erentiation was more robust to noise. Moreover, it was less sensitive to fewermeasurements in the training data. Future work on this part could include to optimizethe network architecture, to see if the network could generalize better.

Furthermore, an attempt to create a PINN to forecast one-step-ahead with automaticdi�erentiation did not succeed. The automatic di�erentiation time derivative collapsedto predict a constant zero, when more information was included as input. A secondmodel with a numerical derivative in the equation, underperformed the baseline networkwhen predicting single time steps. For several steps, there was no significant di�erencebetween the models. Further studies would be needed to conclude why the addedinformation from the equation did not improve the predictive accuracy.

28


Bibliography

[1] Meijering E. “A bird’s-eye view of deep learning in bioimage analysis.” In: ComputStruct Biotechnol J. 18 (2020), pp. 2312–2325. doi:10.1016/j.csbj.2020.08.003.

[2] Debleena et al. Paul. “Artificial intelligence in drug discovery and development.”In: Drug discovery today 26 (2020), pp. 80–93. doi:10.1016/j.drudis.2020.10.010.

[3] Vamathevan, J., Clark, D., Czodrowski, P. et al. “Applications of machine learningin drug discovery and development.” In: Nature Reviews Drug Discovery 1 (2019),pp. 463–477. url: https://doi.org/10.1038/s41573-019-0024-5.

[4] Olivier Paquet-Durand, Supasuda Assawarajuwan, and Bernd Hitzmann.“Artificial neural network for bioprocess monitoring based on fluorescencemeasurements: Training without o�ine measurements”. In: Engineering in LifeSciences 17.8 (2017), pp. 874–880. doi:https://doi.org/10.1002/elsc.201700044. url:https://onlinelibrary.wiley.com/doi/abs/10.1002/elsc.201700044.

[5] Octavio Loyola-González. “Black-Box vs. White-Box: Understanding TheirAdvantages and Weaknesses From a Practical Point of View”. In: IEEE Access 7(Oct. 2019), pp. 154096–154113. doi: 10.1109/ACCESS.2019.2949286.

[6] Jens Smiatek, Alexander Jung, and Erich Bluhmki. “Towards a Digital BioprocessReplica: Computational Approaches in Biopharmaceutical Development andManufacturing”. In: Trends in Biotechnology 38.10 (2020). Special Issue:Therapeutic Biomanufacturing, pp. 1141–1145. issn: 0167-7799. doi:https://doi.org/10.1016/j.tibtech.2020.05.008. url:http://www.sciencedirect.com/science/article/pii/S0167779920301426.

[7] Ramakrishna Tipireddy et al. A comparative study of physics-informed neuralnetwork models for learning unknown dynamics and constitutive relations.arXiv:1904.04058. 2019. arXiv: 1904.04058 [cs.LG].

[8] Harini Narayanan et al. “A new generation of predictive models: The added valueof hybrid models for manufacturing processes of therapeutic proteins”. In:Biotechnology and Bioengineering 116.10 (2019), pp. 2540–2541. doi:https://doi.org/10.1002/bit.27097.

[9] Enrui Zhang, Minglang Yin, and George Em Karniadakis. Physics-InformedNeural Networks for Nonhomogeneous Material Identification in ElasticityImaging. arXiv:2009.04525. 2020. arXiv: 2009.04525 [cs.LG].

29

https://doi.org/10.1016/j.csbj.2020.08.003

https://doi.org/10.1016/j.drudis.2020.10.010

https://doi.org/10.1038/s41573-019-0024-5

https://doi.org/https://doi.org/10.1002/elsc.201700044

https://onlinelibrary.wiley.com/doi/abs/10.1002/elsc.201700044

https://doi.org/10.1109/ACCESS.2019.2949286

https://doi.org/https://doi.org/10.1016/j.tibtech.2020.05.008

http://www.sciencedirect.com/science/article/pii/S0167779920301426

https://arxiv.org/abs/1904.04058

https://doi.org/https://doi.org/10.1002/bit.27097



[10] Yibo Yang and Paris Perdikaris. “Adversarial uncertainty quantification inphysics-informed neural networks”. In: Journal of Computational Physics 394(Oct. 2019), pp. 136–152. issn: 0021-9991. doi: 10.1016/j.jcp.2019.05.027.url: http://dx.doi.org/10.1016/j.jcp.2019.05.027.

[11] Hansika Hewamalage, Christoph Bergmeir, and Kasun Bandara. “RecurrentNeural Networks for Time Series Forecasting: Current status and futuredirections”. In: International Journal of Forecasting 37.1 (Jan. 2021), pp. 388–427.issn: 0169-2070. doi: 10.1016/j.ijforecast.2020.06.008. url:http://dx.doi.org/10.1016/j.ijforecast.2020.06.008.

[12] Renato G. Nascimento, Kajetan Fricke, and Felipe A.C. Viana. “A tutorial onsolving ordinary di�erential equations using Python and hybrid physics-informedneural network”. In: Engineering Applications of Artificial Intelligence 96 (2020),p. 103996. issn: 0952-1976. doi:https://doi.org/10.1016/j.engappai.2020.103996. url:http://www.sciencedirect.com/science/article/pii/S095219762030292X.

[13] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.http://www.deeplearningbook.org. MIT Press, 2016.

[14] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and AccurateDeep Network Learning by Exponential Linear Units (ELUs). 2016. arXiv:1511.07289 [cs.LG].

[15] Yanzhao Wu, Ling Liu et al. “Demystifying Learning Rate Polices for HighAccuracy Training of Deep Neural Networks”. In: CoRR abs/1908.06477 (2019).arXiv: 1908.06477. url: http://arxiv.org/abs/1908.06477.

[16] Diederik Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”.In: International Conference on Learning Representations (Dec. 2014).

[17] F. Pedregosa et al. Scikit-learn:6.3. Preprocessing data. url:https://scikit-learn.org/stable/modules/preprocessing.html.

[18] Zhang Zixuan. Understand Data Normalization in Machine Learning. url:https://towardsdatascience.com/understand-data-normalization-in-machine-learning-8ff3062101f0.

[19] M. Raissi, P. Perdikaris, and G.E. Karniadakis. “Physics-informed neuralnetworks: A deep learning framework for solving forward and inverse problemsinvolving nonlinear partial di�erential equations”. In: Journal of ComputationalPhysics 378 (2019), pp. 686–707. issn: 0021-9991. doi:https://doi.org/10.1016/j.jcp.2018.10.045. url:http://www.sciencedirect.com/science/article/pii/S0021999118307125.

30

https://doi.org/10.1016/j.jcp.2019.05.027

http://dx.doi.org/10.1016/j.jcp.2019.05.027

https://doi.org/10.1016/j.ijforecast.2020.06.008

http://dx.doi.org/10.1016/j.ijforecast.2020.06.008

https://doi.org/https://doi.org/10.1016/j.engappai.2020.103996

http://www.sciencedirect.com/science/article/pii/S095219762030292X

http://www.deeplearningbook.org



http://arxiv.org/abs/1908.06477

https://scikit-learn.org/stable/modules/preprocessing.html

https://towardsdatascience.com/understand-data-normalization-in-machine-learning-8ff3062101f0

https://towardsdatascience.com/understand-data-normalization-in-machine-learning-8ff3062101f0

https://doi.org/https://doi.org/10.1016/j.jcp.2018.10.045

http://www.sciencedirect.com/science/article/pii/S0021999118307125


[20] L. Zheng and X. Zhang. “Chapter 8 - Numerical Methods”. In: Modeling andAnalysis of Modern Fluid Problems. Ed. by Liancun Zheng and Xinxin Zhang.Mathematics in Science and Engineering. Academic Press, 2017, pp. 361–455.isbn: 978-0-12-811753-8. doi:https://doi.org/10.1016/B978-0-12-811753-8.00008-6. url: https://www.sciencedirect.com/science/article/pii/B9780128117538000086.

[21] Scipy documentation. scipy.integrate.solve_ivp. url: https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.solve_ivp.html.

[22] Carter-Greaves Laura E. Time Series prediction with Feed-Forward NeuralNetworks -A Beginners Guide and Tutorial for Neuroph.http://neuroph.sourceforge.net/Time%20Series%20prediction%20with%20Feed-Forward%20Neural%20Networks.pdf.

[23] Ka Chun Luk, James Ball, and Ashish Sharma. “A study of optimal model lagand spatial inputs to artificial neural network for rainfall forecasting”. In: Journalof Hydrology - J HYDROL 227 (Jan. 2000), pp. 56–65. doi:10.1016/S0022-1694(99)00165-1.

[24] Scipy documentation. scipy.stats.wilcoxon. url: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html.

31

https://doi.org/https://doi.org/10.1016/B978-0-12-811753-8.00008-6

https://www.sciencedirect.com/science/article/pii/B9780128117538000086

https://www.sciencedirect.com/science/article/pii/B9780128117538000086

https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.solve_ivp.html

https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.solve_ivp.html

https://doi.org/10.1016/S0022-1694(99)00165-1

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html

PHYSICS-INFORMED NEURAL NETWORKS ... - DiVA Portal

Documents