-
Självständigt arbete i informationsteknologi15 juni 2020
Fault Detection AI For Solar Panels
Jonathan KurénSimon LeijonPetter SigfridssonHampus Widén
Civilingenjörsprogrammet i informationsteknologi
Master Programme in Computer and Information Engineering
-
Institutionen förinformationsteknologi
Besöksadress:ITC, PolacksbackenLägerhyddsvägen 2
Postadress:Box 337751 05 Uppsala
Hemsida:https://www.it.uu.se
Abstract
Fault Detection AI For Solar Panels
Jonathan KurénSimon LeijonPetter SigfridssonHampus Widén
The increased usage of solar panels worldwide highlights the
impor-tance of being able to detect faults in systems that use
these panels. Inthis project, the historical power output (kWh)
from solar panels com-bined with meteorological data was used to
train a machine learningmodel to predict the expected power output
of a given solar panel sys-tem. Using the expected power output, a
comparison was made betweenthe expected and the actual power output
to analyze if the system wasexposed to a fault. The result was that
when applying the explainedmethod an expected output could be
created which closely resembledthe actual output of a given solar
panel system with some over- and un-dershooting. Consequentially,
when simulating a fault (50% decrease ofthe power output), it was
possible for the system to detect all faults ifanalyzed over a
two-week period. These results show that it is possibleto model the
predicted output of a solar panel system with a machinelearning
model (using meteorological data) and use it to evaluate if
thesystem is producing as much power as it should be. Improvements
canbe made to the system where adding additional meteorological
data, in-creasing the precision of the meteorological data and
training the ma-chine learning model on more data are some of the
options.
Handledare: Mats Daniels, Dilushi Piumwardane, Björn Victor och
Tina VrielerExaminator: Björn Victor
-
Sammanfattning
Med en ökande användning av solpaneler runt om i världen
ökar även betydelsen av attkunna upptäcka driftstörningar i
panelerna. Genom att utnyttja den historiska uteffekten(kWh) från
solpaneler samt meteorologisk data används
maskininlärningsmodeller föratt förutspå den förväntade
uteffekten för ett givet solpanelssystem. Den förväntade
utef-fekten används sedan i en jämförelse med den faktiska
uteffekten för att upptäcka omen driftstörning har uppstått i
systemet. Resultatet av att använda den här metoden är atten
förväntad uteffekt som efterliknar den faktiska uteffekten
modelleras. Följaktligen,när ett fel simuleras (50% minskning av
uteffekt), så är det möjligt för systemet att hittaalla
introducerade fel vid analys över ett tidsspann på två veckor.
Dessa resultat visaratt det är möjligt att modellera en
förväntad uteffekt av ett solpanelssystem med en
ma-skininlärningsmodell och att använda den för att utvärdera
om systemet producerar såmycket uteffekt som det bör göra.
Systemet kan förbättras på några vis där tilläggandetav fler
meteorologiska parametrar, öka precision av den meteorologiska
datan och tränamaskininlärningsmodellen på mer data är några
möjligheter.
-
Contents
1 Introduction 1
2 Background 1
2.1 An Overview of Solar Cells . . . . . . . . . . . . . . . . .
. . . . . . . 2
2.2 Factors Affecting Power Output . . . . . . . . . . . . . . .
. . . . . . 2
2.3 STRÅNG - A Solar Irradiance Model . . . . . . . . . . . . .
. . . . . 3
2.4 Machine Learning Concepts . . . . . . . . . . . . . . . . .
. . . . . . 3
2.4.1 Regression vs Classification . . . . . . . . . . . . . . .
. . . . 3
2.4.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . .
. . . . 4
2.4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . .
. . . . 4
2.4.4 Model Validation . . . . . . . . . . . . . . . . . . . . .
. . . . 5
3 Purpose, Aims, and Motivation 6
3.1 Delimitations . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 6
4 Related Work 7
4.1 A Statistical Method to Find Faults . . . . . . . . . . . .
. . . . . . . 8
4.2 Comparing Simulated Output with Measured Output . . . . . .
. . . . 8
5 Method 9
5.1 Programming Language . . . . . . . . . . . . . . . . . . . .
. . . . . . 9
5.2 Machine Learning Model - Random Forest Regression . . . . .
. . . . 10
5.3 Scoring Method and Validation Technique . . . . . . . . . .
. . . . . . 10
5.4 Meteorological Data . . . . . . . . . . . . . . . . . . . .
. . . . . . . 11
6 System Structure 11
-
7 Requirements and Evaluation Methods 13
7.1 Regression Model Testing . . . . . . . . . . . . . . . . . .
. . . . . . 13
7.2 Fault Detection Testing . . . . . . . . . . . . . . . . . .
. . . . . . . . 13
8 Data Gathering 14
8.1 Weather Data from SMHI . . . . . . . . . . . . . . . . . . .
. . . . . . 15
8.2 Solar Irradiance Data from STRÅNG . . . . . . . . . . . . .
. . . . . 15
8.3 Data Format . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 15
9 Data Preprocessing 16
9.1 Removal of Data . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 16
9.2 Constructing and Adding Data . . . . . . . . . . . . . . . .
. . . . . . 16
10 Fault Detection 17
10.1 Expected Output . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 17
10.2 Finding Faults . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 18
10.3 Simulating Faults . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 19
11 Evaluation results 20
11.1 Regression Model Results . . . . . . . . . . . . . . . . .
. . . . . . . 20
11.2 Fault Detection Results . . . . . . . . . . . . . . . . . .
. . . . . . . . 22
12 Results and Discussion 24
12.1 Regression Model . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 24
12.2 Feature Importance . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 24
12.3 Precision of the Meteorological Data . . . . . . . . . . .
. . . . . . . . 25
12.4 Expected Output - Overshooting and Undershooting . . . . .
. . . . . . 26
-
12.5 Fault Detection - Analysis . . . . . . . . . . . . . . . .
. . . . . . . . 27
13 Conclusions 28
14 Future Work 28
A Libraries 32
B Explanatory Variables 32
C Models and Scalers 33
D Data Set 34
E Decrease Tables 36
-
2 Background
1 Introduction
The usage of solar panels, also referred to as photovoltaic (PV)
modules, has seen analmost exponential increase globally during the
last decade [19]. How a system of PVmodules perform depends heavily
on weather [9], dirt covering the panels [2] and amultitude of
other factors. Some factors that decrease the energy production of
PVmodules are natural and can not be prevented by an owner, such as
the angle of thesun, clouds and other weather related causes. There
are other sorts of energy productiondecreases that an owner could
actually stop from happening. Examples of these couldbe a PV module
breaking down, leaves covering the panels or other factors of
similarnature. The focus of this project was to develop a software
based fault detection systemthat can detect decreases of this kind,
in a photovoltaic system.
To detect severe decreases, hereinafter referred to as faults,
in a PV system the historicalenergy production of the system and
meteorological data was utilized. The meteorolog-ical data is
gathered in close geographical and temporal proximity to the PV
system.The energy production of a PV system over a given time
period (e.g an hour) will inthis report be referred to as power
output, measured in the unit of kWh. The poweroutput and
meteorological data was then used to train a machine learning model
whichpredicts the expected power output for a given PV system. The
fault detection systemuses the expected, or predicted, power output
and the actual power output to detect ifthe PV system is faulty or
not. A fault detection system of this kind will make it easyto
diagnose and detect a faulty PV system seeing as it can be done
remotely. This couldlead to faster reparations or necessary
maintenance and more produced energy.
The aim of creating a fault detection system that can detect
faults in PV systems byusing machine learning together with
historical power output and meteorological datawas achieved.
However, the extent of what the system can detect turned out to
bedependant on three parameters: the size of the power decrease,
the threshold and thetime horizon. By simulating a fault (50% power
output decrease) it was found thatthe fault detection system can
detect all of the faults while analyzing over a two-weekperiod.
However, simulating lesser faults, decreasing the time horizon or
decreasing thethreshold may negatively impact the result which is
elaborated on later in the report.
2 Background
This section will begin with a brief description of solar cells,
how they work and whataffects the power output. General concepts
regarding machine learning that are of im-portance to this project
will also be covered.
1
-
2 Background
2.1 An Overview of Solar Cells
Solar cells, or PV cells, generate electricity by being hit by
light. A cell is made ofsemiconductors, often silicon, and the cell
has conductors connected to its positive andnegative side, forming
an electric circuit. When the cell is hit by light, electrons will
bereleased from the semiconductor material and create an electric
current [15]. Multiplecells connected together is called a PV
module and multiple modules connected togetheris called an PV
array. Figure 1 illustrates this.
Figure 1 Visualization of a PV array and its components
A module’s efficiency is based on its direct current (DC) output
under certain conditionscalled Standard test conditions [12]. These
conditions are specific conditions concern-ing a modules
temperature and the solar irradiance the module is exposed to.
Underthese conditions modern PV modules have an efficiency of
around 15%, which meansthat they can convert 15% of the sunlight
into electric energy [32].
2.2 Factors Affecting Power Output
There are a lot of different factors that affect the output of a
solar cell and some majorones will be described here. One example
of this is clouds, which reduce the amountof sunlight that can
reach the PV modules, but it does not remove all of it. On a
cloudyday the power produced can be reduced by 75% [17]
Higher temperatures reduce the power output of PV modules. Each
module has a
2
-
2 Background
property called the temperature coefficient pMax. This
coefficient is provided by themanufacturer and it describes how
much the maximum power will decrease when thetemperature rises 1°C
above 25°C. The reduction in power can range between
10-25%depending on the location of the PV system [6, 18]. For
example if the pMax is -0,5 andthe temperature of the module is
45°C the reduction in power will be 10%.
PV systems will degrade over time and normally the decrease in
power will be 10% after10-15 years and 20% after 20-25 years [22].
PV systems that were subjected to snowand wind had higher
degradation rates than the ones that were not. Also, PV arrays
thatwere mounted in the desert had high degradation rates [16].
2.3 STRÅNG - A Solar Irradiance Model
STRÅNG is a model that calculates different solar irradiance
parameters over north Eu-rope [30]. It was created through a joint
effort between SMHI, Swedish Environmen-tal Protection Agency
(Naturvårdsverket) and the Swedish Radiation Safety
Authority(Strålsäkerhetsmyndigheten). By using stations that can
measure solar irradiance to-gether with information about clouds,
ozone and water vapor, STRÅNG can estimatethe solar irradiance at
a certain latitude and longitude combination. When using
hourlymodel predictions the error can be up towards 30% [31].
2.4 Machine Learning Concepts
Machine learning is the study of computer algorithms that
improve automatically throughexperience[13]. In machine learning,
the data can be preprocessed in an attempt to havethe algorithm
make better predictions or decisions. The algorithms used can be
evalu-ated using different methods but also validated how well they
behave on new data. Thefollowing sections describe machine learning
concepts, such as those above, used in thisreport.
2.4.1 Regression vs Classification
Within machine learning two categories exist, supervised and
unsupervised machinelearning. In supervised learning the model is
given input, called explanatory variablesor features, and an
output, called response variable, whereas in unsupervised
learningonly explanatory variables are given [25, pp 393-394].
Supervised machine learninghave two subcategories, regression and
classification. Both of these share the same
3
-
2 Background
concept of trying to learn the mapping function between the
explanatory variables andthe response variable.
Regression techniques predict a single, continuous, value.
Example of such could be topredict the house pricing based on
explanatory variables such as location and size. Onthe other hand,
classification techniques try to group the predicted value inside a
class.This results in the predicted value to be discrete. Example
of a classification problemcould be to classify an image of an
animal to classes such as a cat or dog.
2.4.2 Data Preprocessing
Preprocessing data is done to prepare raw data for further
processing [23]. Real worlddata can be corrupt, inaccurate and
sometimes data points can be missing. Missing datais a common
occurrence and can threaten data quality. To deal with this there
existtechniques such as imputing a value or to simply exclude the
entire record [24].
A scaler is sometimes applied to the data before feeding it to
the machine learningmodel. There are several reasons to use a
scaler. One aspect is to balance (scale, normal-ize, standardize)
the data, which leads to a more even representation of the
explanatoryvariables [10]. If not performed, the explanatory
variables with large values will heav-ily impact the machine
learning model, even if the explanatory variable in reality
haslittle impact. Furthermore, scaling the data leads to faster
convergence when trainingthe machine learning model.
Further on some data sets, more specifically time series data,
can have seasonal or cycli-cal behavior. An example of such could
be the correlation of time and temperature,higher temperatures
midday than at midnight. Representing this behavior in the datasets
can further improve the later analysis. A method to represent this
as a explanatoryvariable in a data set is to translate the
linearity of time into the cyclical behavior of sineand cosine.
An other technique that can be used when working with time
series data and forecast-ing is the sliding window technique [3].
This technique is based on adding previousresponse variables to the
next entry as explanatory variables. The amount of previousresponse
variables added as explanatory variables determines the window
size.
2.4.3 Evaluation Metrics
Evaluating the accuracy of a machine learning model can be done
in many differentways. Depending on if it is a regression or
classification problem, the methods will
4
-
2 Background
differ. For classification problems the F1 score is a common
one. The F1 score is ameasure between 0-1 and is calculated by
using the predictions a model made, morespecifically the true
positives, false positives and false negatives. The closer the
score isto 1 the better the model is [35].
For regression problems the R2 score and the root mean squared
error (RSME) are twooften used evaluation metrics. The R2 score
shows, on a scale between 0-1, how closethe predicted output is to
the real data, where 1 indicates that the model can explain100% of
the variance in the output. The lower the score the less the model
can explainthe variance in the output [1]. RSME gives, as the name
says, the root of the mean ofthe squared error the predicted output
has from the real output. A low value indicatesthat the difference
between the predicted and real output is small and that the model
isgood [21].
2.4.4 Model Validation
When a machine learning model has been trained it is of
importance to validate it onnew data to avoid issues like
overfitting and selection bias. Below are the validationtechniques
that were considered for this project.
k-Fold Cross-validation
One common validation technique in machine learning is
cross-validation. One methodof cross-validation is what is called
k-Fold cross-validation. The process of the k-Foldmethod consists
of the following steps [25, pp 32-33]:
1. Shuffle the data set
2. Split the data set into k smaller sets
3. Train on k-1 sets
4. Test the prediction accuracy on the leftover set for the
evaluation score
Then, the process can be repeated while holding out different
test sets for the next itera-tion. Lastly, the model evaluation
scores are summarized to represent the total score forthe trained
model.
LOOCV - Leave One Out Cross-validation
Another form of cross-validation is Leave One Out
Cross-validation [4]. This is essen-tially an extreme case of
k-Fold where k is chosen to be the number of data points in
5
-
3 Purpose, Aims, and Motivation
our data set. With the LOOCV method there is no grouping of data
points into smallersets like what is done with the k-Fold method.
Instead, all data points except for one isused to train on while
only one data point is used to test the accuracy of the model.
Thishas the advantage of using more data to train on.
3 Purpose, Aims, and Motivation
The aim of this project was to develop a system that can detect
faults in PV systems. Byusing the historical power output from PV
systems combined with meteorological data,the system should be able
to model the expected output of a PV system. The faults aredetected
by comparing the expected power output with the actual power output
of thePV system.
Concretely, the goals of the system is to find as many real
faults as possible while at thesame time not stating that there are
faults if there are none. The time it takes for thesefaults to be
found should also be as low as possible.
The idea that motivated this project came from our stakeholder
HPSolarTech, a Upp-sala based company working in the photovoltaics
industry. They wanted to explore thepossibility of creating a
software oriented solution that can detect faults in PV
systemsusing machine learning. If this project accomplishes this,
it would be a general improve-ment for the solar industry, which
could result in an increased usage of PV systems inour society.
One of the reasons that this project is of importance is the
fact that the installed PVsystems across the world helps to reduce
the CO2 emissions. Increasing the share of PVin the electricity mix
will hence decrease the environmental impact of countries that
im-plement it [19]. If our project succeeds in making solar energy
more appealing it wouldtherefore have an indirect, positive impact
for the progress towards environmentallysustainable cities.
3.1 Delimitations
A delimitation, concerning the machine learning part of the
system, is that the onlyavailable information about a PV system is
the power output. This makes it hard touse one model for all PV
systems since the PV systems can differ in size and
efficiency.Having one model for all PV systems would mean that the
data for all PV systems couldbe used to train that one model. This
would lead to the model having a lot more data to
6
-
4 Related Work
train on, rather than splitting the data amongst several
models.
An integral part of this project is to be able to detect when
there is a fault in a PVsystem in the form of a technical
malfunction. However, the data available from the PVsystems does
not have recorded information about when faults have actually
occurred.Consequentially, there are two effects on this project.
Firstly, this means that the modelmight be training on faults. For
example, if one of the PV systems have had a fault foras long as
data has been recorded there is no way for the fault detection
system to tellthat there is a fault. Secondly, the faults the
system is to make predictions on needs to besimulated. While these
simulations can be close to reality they obviously do not
exactlycorrespond to what an actual fault would look like.
During November, December and January the panels have a high
risk of being coveredby snow. The system considers snow covering
the panels a fault and should alert whena panel is covered by it.
The consequence of this is that if the model is trained on
dataduring the mentioned months it will train on faulty data
according to the definition ofa fault. Therefore, these months were
removed from the data sets which significantlyincreased the
performance of the model. The effect of doing this is that the
system willnot work during these months.
Lastly, one of the main points of the system is to be able to
detect malfunctioning solarpanels with the only information
extracted from a given PV system being its poweroutput. By only
using the power output it is not possible for the fault detection
systemto know if the PV system in question grows in size or
capacity, there would have toexist extra information to know this.
This means that it is possible for a PV system toincrease in
capacity while the model is trained on it having lower capacity
which willmake it hard for the model to accurately predict if there
is a fault.
4 Related Work
This section highlights some studies and projects related to
this project in ways of de-tecting faults in PV systems. The
referenced systems are similar in certain ways likewhat metric is
measured from the PV modules while they differ in how they analyze
thegathered data.
7
-
4 Related Work
4.1 A Statistical Method to Find Faults
Several studies have conducted analysis of various faults
occurring in photovoltaic sys-tems. A study [34] makes use of two
different statistical methods. This is done inorder to build a
confidence interval for each subsystems power output in the PV
system.The underlying assumption however, is that this purely
statistical analysis can only beperformed if the PV system can be
viewed as a system divided into several equivalentindependent
subsystems. In practice, this means that the power output must be
readablefor each subsystem (which might not be the case). This
study is relevant here since al-though it does not make use of
meteorological data, the only physical quantity measuredat the PV
system which is analyzed is the power output. The difference
between theirapproach and the approach used for this project is
regarding what data is used. For thisprojects fault detection
system there is less granularity in the information from the
PVsystems. This is on the other hand compensated for with
additional meteorological data.How the data is processed to detect
failures also differs a lot since the study mentioneduses purely
statistical methods to determine if a failure has occurred while
this projectuses machine learning models to predict failures.
4.2 Comparing Simulated Output with Measured Output
Another study that is interesting in relation to this project is
the joint study by the Eu-ropean Commission which was a vital part
of the PVSAT-2 project [26]. This studycompared the predicted
(simulated) power to the measured power from the PV system,taking
meteorological data into consideration. The comparisons are done
for differenttime spans, comparing the current day, past 7-days as
well as past 30-days. Each ofthese time spans are taken into
consideration when answering whether or not the mea-sured output is
in line with the simulated output. In the case of there being a
significantdifference in the output a profile is created,
describing the failure of the PV system. Bythen comparing the
created profile with predefined ones for the different failures,
theprobability for each one of them is calculated. Similar to this
project, meteorologicaldata is directly used in order to
simulate/predict the current power output from the PVsystem, as
well as power being the only measurement. The key difference here
is thatthe predicted power is simulated purely off a mathematical
model, whilst in this projectit is determined by using AI and
machine learning.
Furthermore, other systems have been developed to automatically
detect faults in PVsystems by using the method of simulating the PV
system by using a mathematicalmodel. Another example of this is the
procedure for automatic fault detection presentedby Silvestre,
Chouder and Karatepe where they, akin to the PVSAT-2 study, take
the
8
-
5 Method
output of the simulated PV system and compare it to the actual
output of the PV systemto determine whether a fault has occurred or
not [27]. This is once again related to ourmethod, with the
difference being that the expected output from our PV system is
notgenerated by a simulated PV system. Instead, our system
generates an expected outputbased on historical data by using
machine learning as described in Section 5.
5 Method
The following section describes the different methods and
techniques used in this project.It explains what programming
language was used, from where the data was gathered,what machine
learning model was used and why it was the best fit for this
project.
5.1 Programming Language
The code base for this project is written in the programming
language Python. Accord-ing to Thomas Elliot at GitHub, Python was
one of the most popular languages usedfor machine learning in 2018
[5]. Libraries generally used in connection to machinelearning
include numpy for matrix operations, pandas for applying the matrix
opera-tions to data sets and sklearn to perform machine learning
algorithms to the data. Theselibraries are used in the machine
learning part of this project.
The main contender to Python for machine learning is R. Since
this project aims tocreate a system that is usable in the solar
panel industry and not in isolation it is im-portant that it is
written in a programming language that is easy to combine with
otherlanguages. Because of the fact that Python is a general
purpose language it is easier tocombine with other languages than R
while it also simplifies tasks such as accessingAPIs [8].
Furthermore, the aforementioned libraries for Python make it easier
to ma-nipulate data in Python than in R which is important for the
handling of the data fromdifferent sources that is processed in the
system.
Considering the aspects discussed above the choice of Python as
the programming lan-guage for the entire system was
straightforward. It simplifies integration, data process-ing while
also having the required support for the machine learning
operations that areneeded.
9
-
5 Method
5.2 Machine Learning Model - Random Forest Regression
In order to determine which machine learning model would be used
to predict the ex-pected output a multitude of models were tried,
the result of which can be found in Table1. Random Forest
Regression, or RFR [14], came out as the best performing one. RFRis
a more advanced version of regression trees. A regression tree is a
decision tree wherethe output is continuous (i.e a real number)
instead of discrete (true, false or choice A,B, C etc). Several
regression trees are created by replacing some of the training data
foreach tree. Together these regression trees makes up the
”forest”. Finally, the output ofthe RFR is calculated as the mean
value of all trees in the forest.
As can be seen in Table 1, using RFR with no scaler gave the
best R2-score. As men-tioned in Section 2.4.2 a scaler can
sometimes be used on the data set. This is mostly toavoid the issue
of having explanatory variables with different sizes. In RFR this
is not aproblem, since the partitioning in the regression trees
will be the same even if you scalethe data as long as the order is
the same [11].
5.3 Scoring Method and Validation Technique
The scoring method to evaluate the machine learning models in
this project was chosento be the R2-score. In the choice between
the R2 and the RMSE scoring methods, theR2-score was favored since
it is a relative measurement of how well the model fit thedata. In
order to get a feel of how big an error is with RMSE, the RMSE
needs to becompared to the size of the values in the data. Using R2
(which is always between 0-1)the meaning of the score is always
intuitive and easy to understand [33]. In practice,when it comes to
finding the best machine learning model however, any of these
couldbe used. This is because searching for the lowest RMSE and the
highest R2 score wouldyield the same result.
When determining what validation technique to use the main
concern, apart from actu-ally being able to validate the model, was
computation intensity. Since k-Fold repeatsthe process described in
section 2.4.4 k times and LOOCV is essentially k-Fold withas large
k as possible it is naturally more computation intensive. Repeating
this k-Foldprocess for every data point in the data sets where most
of them span multiple yearswith data recorded every hour would be
too time-consuming.
10
-
6 System Structure
5.4 Meteorological Data
Weather data is needed to predict the output of a PV system.
Since there is no weatherdata included in the data from the PV
systems, it had to be retrieved from somewhereelse. The only
service found that provides historical weather data in Sweden is
SwedishMeteorological and Hydrological Institute’s (SMHI) open
data. By using SMHI opendata it is possible to get the
meteorological data for a collection of weather stations inSweden.
This data contains parameters such as air temperature, humidity,
solar irra-diance, air pressure and cloud base [28]. How often each
station saves their data willdiffer from station to station between
hourly or once each day, depending on how ad-vanced the station is.
It is possible to retrieve either the last hour, the last day, the
lastfour months or historical data that has been quality
controlled. Some parameters such asair temperature have over 1000
stations located in different parts of Sweden while solarirradiance
only has about 20 [29]. To compensate for this STRÅNG, a model
whichcalculates solar irradiance at a specific coordinate in
northern Europe (see Section 2.3),is used for the solar irradiance
parameter.
6 System Structure
The system consists of three modules: Data gathering, data
preprocessing and faultdetection. See figure 2. Data gathering is
where calls to different API’s are done and theresulting data is
saved. The data from the API calls is then integrated and processed
inthe data preprocessing module. The fault detection module is
where a decision is madewhether the PV system is faulty or not.
11
-
6 System Structure
Figure 2 Overview of the system modules and the systems process
of finding faults.
The first module of the system is the Data Gathering module.
This is where the systemmakes different API calls to retrieve
meteorological data and combine it with PV systemdata for a given
PV system. After the data is gathered and compiled into the same
file,each PV system has their own file containing data for the PV
system correspondingmeteorological data for each data point. The
compiled file is then forwarded to the DataPreprocessing
module.
The Data Preprocessing module is responsible for two key areas.
Firstly, cleaning thedata retrieved by the Data Gathering module.
Cleaning data is the process of detectingand correcting corrupt or
incorrect data. Secondly, construct and add more features tothe
data set. Constructing features refers to the process of converting
existing features toanother form. After these steps the data can
now be used in the Fault Detection module.
The last module of the system is the Fault Detection module.
This module involves twocomponents. One component for predicting
the expected output for a PV system usingthe machine learning
model. The other component uses the expected output to detect ifa
PV system is faulty or not.
12
-
7 Requirements and Evaluation Methods
7 Requirements and Evaluation Methods
This section presents the evaluation methods for finding what
regression model per-formed the best and how the fault detection
system will be tested.
7.1 Regression Model Testing
To find what machine learning model is best suited for
predicting the output of a PVsystem, a comparison between a number
of different regression models was done. ShayGeller [7] published
an article on Towards Data Science where he compared the resultsof
using different machine learning models on a classification
problem. For each modelGeller also compared the results for a
variety of different scalers. In the end it waspossible to see
which combination of scaler and model had the highest F1-score for
theproblem. By switching all classification models to regression
models and changing theF1-score to R2-score, Gellers code was used
to find the best scaler-model combination.The models and scalers
that were used for the comparison are listed in the appendixunder
section C.
The R2-score that is calculated for a model-scaler combination
shows how good thepredictions are for a single PV system. As
mentioned in Section 2.4.3 the R2-scoreindicates how well the model
can explain the variance of the output, a higher score isbetter.
The results from one PV system might not be representative of every
PV system.Therefore the test is run over multiple PV systems and
the mean of the R2-score for allPV systems is used. The
model-scaler combination with the highest mean R2-score isthe one
that will be used. To evaluate how well the model can represent the
expectedoutput, the results in the fault detection part is used as
a measurement. If the model isable to find faults, while still
upholding the requirements described in Section 7.2, themodel is
said to be good enough.
7.2 Fault Detection Testing
To see how well the fault detection system can find faults,
three tests will be conducted.In the first test the simulated fault
will be a 40% decrease, the second will be a 50%decrease and the
third one will be a 60% decrease. For each test there will be 42
PVsystems that will each have the specified fault simulated in the
entire test set. Thismeans that there are 42 possible faults to
find. Furthermore, for each test there are twovariables that can be
changed to find the most optimal fault detector, the threshold
andthe time horizon. The tests will be done using thresholds
between 20% - 75% and time
13
-
8 Data Gathering
horizons between 3 - 21 days. More information about the
threshold, time horizon andsimulated fault can be found in Section
10.2.
Each test will also be run twice where the second time will be
without any simulatedfault. This is because if the system finds a
fault in a test without any simulated faults thisneeds to be marked
as a false positive. If instead no fault was found in the test
withoutsimulated faults but a fault was found in the test with
simulated faults, this is marked asa true positive. If no fault was
found in either the test without simulated faults or the testwith
simulated faults, this is marked as a false negative. More
concretely a false positivewould mean that the system found a non
existing fault, false negative would mean thesystem did not find an
existing fault and a true positive would be the system found a
realfault.
In addition to actually finding the fault, the time it takes to
find the fault is also animportant aspect. To measure this the
average time it takes for the system to find a fault,when run over
all PV systems, is used. Note that the time it takes for the system
to finda fault is not actual run time but how many data points need
to be used to find a fault.
The following requirements have been chosen to evaluate for
which values of the vari-ables the results are acceptable:
1. No false positives.
2. The average time it takes for the system to find the fault
needs to be the same asthe time horizon.
3. No false negatives.
The first requirement was chosen because it is not acceptable
for the system to say thatthere is a fault when there is none,
since that might lead to an attempted reparation of anon faulty
panel. The second requirement was chosen because when using a
certain timehorizon it is expected that the system finds the faults
in that time. If the average timeexceeds the time horizon it means
that some faults were unnoticed during some timewhich is something
that is to be avoided. The third requirement was chosen because
thesystem should be able to find all the faults.
8 Data Gathering
An overview of the data gathering process is as follows: for
each PV system, retrieveweather data from SMHI, as well as solar
irradiance data from STRÅNG and combine
14
-
8 Data Gathering
the data into a single CSV file for the PV system. Once this is
complete for all PVsystems, the database containing power output is
accessed and for each PV system itscorresponding values are read
and added to the existing CSV file.
8.1 Weather Data from SMHI
By using the SMHI Open Data API for meteorological observations,
historical data canbe gathered from a weather station. The API is
used by calling for a specific parameter(e.g air temperature), from
a specific weather station. The immediate problem is howto decide
on what station to gather the parameter data from. The method used
is tocalculate the distance from the PV system to the different
stations. For a majority of thePV systems analyzed, either the
exact coordinates are known or at the very least the zipcode which
can then be used in conjunction with geocoding packages such as
pgeocodeto find its approximate coordinates.
By using the PV systems coordinates and comparing them to the
coordinates of eachweather station the nearest station can be
found. However, different weather stationsmeasure different
meteorological parameters and additionally, different stations
havedifferent time spans. Taking this into consideration, a new
station is chosen for eachindividual parameter. The chosen station
is not always the closest station, but ratherthe closest station
which measures the sought parameter and has the correct time
span.There is however one parameter which is not gathered by the
SMHI API calls, namelysolar irradiance, and that is where STRÅNG
is used.
8.2 Solar Irradiance Data from STRÅNG
Accessing STRÅNGs data can be done with an API call containing
the start date, enddate, coordinates and parameter. The parameter
in the STRÅNG API that is used iscalled CIE UV irradiance. The
start and end dates are dynamically chosen based off ofthe time
span in the earlier SMHI call. The coordinates can directly be read
off the PVsystem and used in the API call, compared to SMHI API
call where a station had to belocated.
8.3 Data Format
When all the API calls are done the data needs to be compiled in
a CSV file. Therewill be one row of data for every hour in the time
span used in the data gathering. For
15
-
9 Data Preprocessing
each row there will be columns containing the date
(year/month/day/hour) representedas a Unix timestamp [20], the
meteorological data (explanatory variables) and the poweroutput
(response variable).
9 Data Preprocessing
The steps of preprocessing the CSV files is as follows: Read the
CSV file, remove rowswith missing values, remove unnecessary
explanatory variables, construct and add someexplanatory variables.
An example of a CSV file used can be found in Appendix D.
Thefollowing subsections discuss the steps of removing data and the
way new explanatoryvariables are constructed and added.
9.1 Removal of Data
As mentioned before in section 2.4.2, real world data can often
be corrupt or have miss-ing values. In the CSV files there are no
corrupt values but there might be a few missingvalues. This problem
could come from the APIs, SMHI and STRÅNG, having somemissing data
points. To resolve this problem, the rows with missing values are
simplyremoved. Apart from this a column with the date, in the
format of Unix time, also needsto be removed and not used as a
explanatory variable. This column represents the dateas a
continuous value which is increasing in a linear fashion, this does
not correlate tothe cyclical behavior of hours in a day and months
in a year. For example, during certainparts of the day and certain
months there are more sun hours.
9.2 Constructing and Adding Data
In the previous section an example of the cyclical behavior of
hours and months wasmentioned. Because these factors play a big
role in solar production they cannot befully disregarded when
analysing the data set. To add these factors back into the datasets
but with the cyclical and seasonal behavior, sine and cosine is
used. Firstly, twocolumns for hours is generated by converting a
hourly time to a numerical value gen-erated by sine and cosine.
Following is how these values are converted and generated:sin
hour∗2π
24or cos hour∗2π
24, where hour is the specific hour to be converted (a
number
between 0 and 23). Similarly, this is done for months in the
following way: sin month∗2π12
or cos month∗2π12
, where month is the month represented as a value between 0 and
11.
16
-
10 Fault Detection
Another column is then added to the data set, this column has
values based on a tech-nique called sliding window. As mentioned in
section 2.4.2, this works by adding theprevious response variable
as a explanatory on the next row. This becomes problematicfor the
first row, since this row has no previous value. Because the value
is missing therow is deleted, similar to how it is handled in
Section 9.1. The new first row uses theoutput for the removed row
as the sliding window value.
10 Fault Detection
The steps in the fault detection will be as following. An
expected output is created usinga machine learning model. Then
faults will be simulated in the real output. The systemwill then
try to find these faults by comparing the expected output with the
faulty output.
10.1 Expected Output
The expected output is calculated by using a Random Forest
Regression model. Thenumber of trees in the random forest was set
to 100. During testing with lower numberof trees, significant drops
in the R2 score were observed. Increasing it to more than 100trees
did not yield any observable improvement in terms of R2 score,
while negativelyaffecting runtime.
When creating the expected output for a PV system, 80% of the
data set is split intoa training set and the remaining 20% is used
as the test set. These sizes were chosensince having a big training
set leads to a better trained model, which gives better
results.However when increasing the size of the training set to
90%, the test set of 10% becamevery small for some PV systems which
did not have a lot of data. A small test set mightnot represent the
data accurately. Splitting the training and test sets to 80-20 was
agood balance between having as big a training set as possible
whilst still having goodrepresentation of the total data in the
test set. The random forest model then trains onthe training set
and is used to make predictions on the test set. The predictions
made onthe test set will be the expected output of the PV system
and is used in the part of thesystem that finds the faults.
17
-
10 Fault Detection
10.2 Finding Faults
Fault detection is performed by comparing the expected output
with the actual outputof the PV system. The difference between the
expected output and measured outputrelative to the expected output
gives the relative fault in terms of percentage. The relativefault
is used in a comparison with a threshold of for example 50%.
The fault detection does not happen momentarily, i.e. for one
data point. Instead, alldata points over a specified length of
time, called the time horizon, are considered.Figure 3 shows how a
seven day time horizon over the predicted output and a
faultyoutput.
Figure 3 A 7 day time horizon visually represented on a plot
showing the predictedoutput vs a faulty output with a 50%
decrease
For all data points in the time horizon the error i.e. the
average difference relative to theaverage expected output is
considered:
Error =avg(expected−measured)
avg(expected)
This Error is what is compared with the threshold, if it exceeds
the threshold the systemwill report a fault. If no error is found
the time horizon will move forward one hourand the error will be
calculated again. In this way the time horizon will move
through
18
-
10 Fault Detection
the entire data set containing as many data points as the size
of the time horizon at anygiven time. The system will report no
errors if the entire data set is traversed by the timehorizon and
no errors exceeded the threshold. The correlation between the
thresholdand the time horizon is typically that smaller time
horizons lead to more variance, whichresults in needing a higher
threshold in order to not report false positives. Similarly, alarge
time horizon results in needing a smaller threshold, as not to
yield false negatives(i.e, missing actual faults).
10.3 Simulating Faults
As described in section 3.1.3 one delimitation of this project
was that to test the sys-tem, faults had to be simulated in the
data sets. To simulate the faults, the test setcontaining the power
output data is decreased by a certain percentage during
chosenintervals. The function that handles the decreasing of the
data takes a list of (interval,percentage)-tuples as input
arguments, where the percentage decrease is applied overthe given
intervals. Depending on the length of these intervals and the
magnitude of thepercentage decrease the difficulty of detecting a
fault will vary.
Figure 4 Example of how the system predicts vs the real output
with a 50% simulatedfault
Figure 4 showcases the predicted output versus the real output
with a simulated faultover a two-week period. The difference
between the two plots is what is utilized in thefault detection
part of the system.
19
-
11 Evaluation results
11 Evaluation results
This section shows the results of what model-scaler combination
performed the best,together with example plots of how the
predictions looks like visually. The section alsoshows the results
on how well the fault detection system can find simulated
faults.
11.1 Regression Model Results
Table of R2-scores for each model-scaler combination where the
highest R2-score ismarked in bold. What can be seen in this table
is the R2-scores of each model-scalercombination. For each model
every scaler is tried in combination to see which combi-nation
performs the best. As can be seen, random forest regression
outperforms everyother model.
ModelScaler DTR KNN LASSO LR MLP RFR
No Scaler 0.8634 0.8139 0.8629 0.8704 0 0.9218MaxAbsScaler
0.8708 0.9001 0.8095 0.8546 0.8843 0.9200MinMaxScaler 0.8710 0.9043
0.8064 0.8547 0.9105 0.9200
Normalizer -0.0015 0.8642 -0.0015 0.8527 -0.0023 -0.0015PT
Yeo-Johnson 0.8714 0.8935 0.7431 0.7511 0.9154 0.9216
QT-Normal 0.8710 0.8994 0.5948 0.5994 0.9145 0.9200QT-Uniform
0.8705 0.8980 0.6081 0.6557 0.9196 0.9200RobustScaler 0.8709 0.9081
0.8524 0.8547 0.9105 0.9200
StandardScaler 0.8709 0.9076 0.8463 0.8547 0.9095 0.9199
Table 1 Model-Scaler result.
Figure 5 and Figure 6 shows the predicted output versus the real
output of a PV systemduring two different 14 day periods. The
predicted output is generated by the machinelearning model chosen
after evaluating the model-scaler combination R2-scores, i.e.random
forest regression. It is observable that the predicted output in
Figure 5 moreclosely resembles the real output than in Figure 6
which is explained in Section 12.4.
20
-
11 Evaluation results
Figure 5 Example of how well the system predicts the output
during a two-week period
Figure 6 Example of how the system predicts the output during a
two-week period
21
-
11 Evaluation results
11.2 Fault Detection Results
Result of the fault detection system when using different time
horizons representedas true positives (correctly found faults),
false positives (”detected” a fault when noneexisted), false
negatives (missed a fault) and the average time for all the PV
systems forthat threshold, taken until, in days, the error was
found. All the tables shows the resultsfrom the test where the
simulated error was a decrease of 50%. The results from theother
tests with 40% and 60% decreases can be found in Appendix E.
Table 2 shows that, with a time horizon of 21 days, it is
possible to detect all faultswith a threshold ranging from 35-45%.
Decreasing the time horizon further makes thesystem detect
non-existing faults as indicated by the one false positive at
threshold 30%.Conversely, increasing the threshold above 45% makes
the system miss reporting actualfaults which is demonstrated by the
five false negatives at threshold 50%.
Threshold % True Positive False Positive False Negative Average
days until error found20 37 5 0 21.00025 39 3 0 21.00030 41 1 0
21.00035 42 0 0 21.00040 42 0 0 21.00045 42 0 0 21.27550 37 0 5
34.59755 19 0 23 64.54060 3 0 39 69.764
Table 2 50% decrease with a time horizon of 21 days.
Table 3 shows that with a time horizon of 14 days it is possible
for the system to detectevery fault with a threshold of 45%.
Setting the threshold higher than 45% has the sameeffect as in the
case of a 21 day time horizon where the system begins to miss
actualfaults. Further, it is noteworthy that setting the threshold
to lower than 45% e.g. 40% or35% makes the system miss faults where
it did not with the 21 day time horizon.
22
-
11 Evaluation results
Threshold % True Positive False Positive False Negative Average
days until error found35 40 2 0 14.00040 41 1 0 14.00045 42 0 0
14.30650 38 0 4 27.50355 25 0 17 59.97760 11 0 31 64.515
Table 3 50% decrease with a time horizon of 14 days.
The most notable result of Table 4 is that it is no longer
possible for the system to detectall faults with any given
threshold. This is the result of having a small time horizon setto
7 days.
Threshold % True Positive False Positive False Negative Average
days until error found35 36 6 0 7.00040 37 5 0 7.02745 41 1 0
7.31650 41 0 1 11.24455 31 0 11 48.95260 20 0 22 56.808
Table 4 50% decrease with a time horizon of 7 days.
Table 5 illustrates an extreme case where the time horizon is
very small. Noticeably,the non-existing faults reported as faults
are quite many compared to the other timehorizons. Increasing the
threshold can compensate for this as can be seen for
thresholdsabove 60%. However, for these greater thresholds, the
unreported actual errors increase.
23
-
12 Results and Discussion
Threshold % True Positive False Positive False Negative Average
days until error found35 15 27 0 3.00040 15 27 0 3.01445 20 22 0
3.32150 28 14 0 4.39455 29 9 4 25.61660 24 4 14 45.88965 24 2 16
54.70370 26 0 16 61.19175 13 0 29 56.224
Table 5 50% decrease with a time horizon of 3 days.
12 Results and Discussion
In this section the results of the fault detection and the
regression model, along with itsfeature importance will be
discussed.
12.1 Regression Model
The R2-scores shown in Table 1 indicates that Random Forest
Regression performs thebest on the given data. Furthermore, the
best R2 score achieved given RFR, is when noscaler is used but the
results for all scalers are very similar. This is to be expected
sincescaling the data has close to no impact on this type of
model.
12.2 Feature Importance
Feature importance is a quantitative measurement of how
important the feature/explana-tory variable is in the ML model. A
higher feature importance means that the explana-tory variable
contributes more to the decision in the model. For example if the
featureimportance is zero, the explanatory variable plays no part
in the decision making. Thefollowing table displays the feature
importance for the RFR model.
24
-
12 Results and Discussion
Explanatory Variable Feature ImportanceSun Hours 0.80198
Previous Output 0.14447Air Humidity 0.01184Air Pressure
0.00862
Cloud Coverage 0.00735Air Temperature 0.00725
sin(hour) 0.00668sin(month) 0.00418cos(hour) 0.00348
Precipitation 0.00246cos(month) 0.00171
Table 6 Feature importance of each explanatory variable in
descending order.
Unsurprisingly Sun Hours was the most dominant explanatory
variable. Seeing asthe power output drops to zero when there’s no
sun present this comes rather natural.Something that was not in
line with our initial thinking however, was the (un)importanceof
the Air Temperature, which we originally thought would be one of
the top rankingexplanatory variables. Another thing that was rather
unexpected was that PreviousOutput, which is a parameter
created/constructed rather than directly read off a weatherstation
or similar, had such a major influence on the model. What this
means, is thatthe current value actually is rather dependant on
what the previous value was. Thisconcretely shows the benefit of
constructing and adding additional explanatory variablesto the
machine learning model.
Since the explanatory variables has a one to one mapping to the
input parameters, theRFR model is limited by which parameters are
used. This means that it is possible thatother meteorological
parameters that we did not take into account could have had a
moreexplanatory effect on the power output. An easy way to test if
that is the case is to fetchmore of the available parameters from
SMHI like thunder probability or wind directionand train the model
with the additional parameters.
12.3 Precision of the Meteorological Data
It is of importance that the fault detection system is able to
retrieve as precise meteo-rological data as possible so that it can
represent the weather surrounding a given PVsystem. However, when
retrieving meteorological data for a given PV system the fault
25
-
12 Results and Discussion
detection system finds the closest SMHI weather station to the
(lat,long)-coordinates ofthe given PV system as described in
section 8.1. After finding the closest SMHI weatherstation it then
retrieves the data. In some cases this means that the weather
station that isthe closest might actually be far away from the PV
system which results in non-accuratedata.
Furthermore, when retrieving data for solar irradiance using
STRÅNG there is an er-ror margin of 30% that cannot be not
affected. Additionally, the meteorological datagathered at a
specific date and time will be the current meteorological state at
the givenposition at that exact time. This means that it does not
exactly represent the weathersurrounding the station for the entire
hour that power output data has been collected.Consequently, if the
weather is unstable during that hour, the meteorological data
gath-ered will not have a precise correlation to the power output
of the PV system. Theresult of these factors is that the machine
learning model may train on data that does notrepresent reality
accurately.
12.4 Expected Output - Overshooting and Undershooting
Figure 5 and Figure 6 illustrates how the system predicts the
output of a given PVsystem during a two-week period between
August-September and October respectively.In this context
overshooting means that the system believes that the PV system
shouldbe producing more power output than it is which can lead to
false positives. Conversely,undershooting means that the system
believes that the PV system should be producingless power output
than it is which can lead to false negatives. It is observable that
for thefirst period the model has less under- and overshooting.
This is most likely due to thepower output being more stable and
hence more predictable. The second period showsthat the system
sometimes overshoots with its predictions e.g. between 2019-10-11
and2019-10-13 while also undershooting at around 2019-10-21. This
means that based onthe figures presented we can expect the system
to report some erroneous information.A reason why the expected
output is overshooting and undershooting might be the errormarginal
in the STRÅNG data, which as mention in section 2.3 can be 30%. An
obviousimprovement would therefore to have more accurate sun data.
One way to achieve thiswould be to actually measure the irradiance
at the location of the PV system. Thiswould however require extra
measurement equipment and is not within the scope of thisproject,
where a software based solution was explored.
26
-
12 Results and Discussion
12.5 Fault Detection - Analysis
The results presented in Section 11.2 showcases that it is in
fact possible for the systemto fulfill the requirements described
in Section 7.2. If analyzing over a time horizon of21 (Table 7)
days it is possible to achieve zero false positives and have the
average dayuntil error found be the same as the time horizon. It
can also be seen that when havinga larger time horizon it is
possible to have a lower threshold and still keep the
falsepositives close to or at zero. To further increase the
likelihood of keeping false positivesat zero we would need to set
the threshold relatively high, which would increase falsenegatives
and average days until error is found. When having a time horizon
of 14 days,the goal of having zero false positives is achieved for
one threshold, but the average daysis slightly higher than the time
horizon. Even though this does not fulfill the secondrequirement,
this might be good enough to use in practice, since the average
days is stillvery close to the time horizon. Having a time horizon
of 3 or 7 days makes it difficultfor the system to detect more than
50% of the total errors while achieving zero falsepositives which
can be seen in Table 9 and Table 10. Along with having low
detectionpercentage (true positives vs false negatives) the average
days until error is found growsvery large for both time horizons of
3 and 7 days. Therefore having short time horizonsdoes not fulfill
the requirements.
These results are only based on the sample of 42 PV systems
where we simulated anerror. This means that these results, where
all the faults can be found with no falsepositives, are not to be
guaranteed in practice. This is because there might be other
PVsystems that have some factor that affects the output that our
model cannot explain.
While we also simulated a fault of 40% and 60% those results do
not supply more in-formation than the 50% results discussed and
presented in this section. Unsurprisinglythey show that a larger
fault makes it easier for the system to detect the faults.
Mean-while, a smaller fault makes it harder for the system to find
the fault. This can be seenin Appendix E. Furthermore, we did not
have any information on how big decreases arefor faults in
practice. It is therefore possible that the percentages chosen in
the testingare too big. This would lead to the system not being as
good as our testing indicates. Ifreal faults are smaller, maybe a
20% decrease, the system would need to have a lowerthreshold. This
comes with the downside of having a higher risk of false positives
andto fix this the time horizon would need to be longer and
therefore increasing the time ittakes to find faults.
27
-
14 Future Work
13 Conclusions
The aim of the project was to create a system that can detect
faults in PV systems withthe use of machine learning and
meteorological data. This was achieved by creating anexpected
output of a PV system which is compared to the actual output. The
extent ofwhat the system can detect turned out to depend on three
parameters: the size of thepower decrease, the threshold and the
time horizon. The results found in this reportshows that given a
certain decrease, the threshold and time horizon can be
appropriatelytweaked in order to yield the wanted results of zero
false positives, average days thesame as the time horizon and good
detection percentage (true positives compared tofalse negatives).
In total, it is possible to detect faults in PV systems by
utilizing machinelearning and meteorological data.
14 Future Work
There are a couple of improvements that could be made to this
project to make it moreeffective at achieving the stated goals. The
following are some of them in regards to thedelimitations brought
up in Section 3.1.
The algorithm for finding the nearest weather station for a
given meteorological pa-rameter was described in section 8.1. This
method of finding parameters was a way ofassuring that it is
possible to find the parameters to train a model for a given PV
system.However, it poses an important question. Namely, how far
away can a weather stationbe for a given parameter to be
representative of the weather condition where the PVsystem is
located? It might be the case that the system is retrieving
parameters that aretoo geographically distant for them to have any
explanatory effect on the power output.Further studies could be
made on this to create a more theoretically supported algorithmof
collecting parameters that could possibly increase the precision of
the random forestregressor model.
Furthermore, since the system is very limited in the information
it has about the PVsystems it tries to model, as described in
section 3.1, one model per PV system is used.If more information
was added, like size and capacity, it would be possible to train
onemodel for all PV systems. This would probably increase the
prediction accuracy if theadditional information added to the
system was reasonably explanatory of the poweroutput. In other
words, the additional information would need to have a strong
enoughcorrelation to the power output.
28
-
References
References
[1] Coefficient of Determination. New York, NY: Springer New
York, 2008, pp.88–91. [Online]. Available:
https://doi.org/10.1007/978-0-387-32833-1 62
[2] B. A. Alsayid, S. Y. Alsadi, J. S. Jallad, and M. H. Dradi,
“Partial shading of PVsystem simulation with experimental results,”
Smart Grid and Renewable Energy04(06):429-435, 2013.
[3] J. Brownlee. Time Series Forecasting as Supervised Learning.
Retrieved 2020-05-07. [Online]. Available:
https://machinelearningmastery.com/time-series-forecasting-supervised-learning/
[4] B. Clarke, E. Fokoué, and H. H. Zhang, Principles and
Theory for Data Miningand Machine Learning. Springer
Science+Business Media, 2009, p. 588, ISBN:978-0-387-98135-2.
[5] T. Elliot. (2019, Jan.) The state of the octoverse: machine
learning. Retrieved2020-04-29. [Online]. Available:
https://github.blog/2019-01-24-the-state-of-the-octoverse-machine-learning/
[6] S. Fox. (2017, Dec.) How Does Heat Affect Solar Panel
Efficiencies? Retrieved2020-04-07. [Online]. Available:
https://www.civicsolar.com/article/how-does-heat-affect-solar-panel-efficiencies
[7] S. Geller. (2019, Apr.) Normalization vs Standardization —
QuantitativeAnalysis. Retrieved 2020-04-28. [Online]. Available:
https://towardsdatascience.com/normalization-vs-standardization-quantitative-analysis-a91e8a79cebf
[8] M. Grogan, Python Vs. R for Data Science. O’Reilly Media,
Inc, 2018, ISBN:9781492033929.
[9] B. Guo, W. Javed, B. W. Figgis, and T. Mirza, “Effect of
dust and weather con-ditions on photovoltaic performance in doha,
qatar,” in 2015 First Workshop onSmart Grid and Renewable Energy
(SGRE), 2015, pp. 1–6.
[10] J. Hale. Scale, standardize, or normalize with
scikit-learn. Retrieved 2020-05-15. [Online]. Available:
https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02
[11] T. Hill and P. Lewicki, Statistics: Methods and
Applications : a ComprehensiveReference for Science, Industry and
Data Mining. International Energy Agency,2006, pp. 99, 100, ISBN:
1-884233-59-7.
29
https://doi.org/10.1007/978-0-387-32833-1_62https://machinelearningmastery.com/time-series-forecasting-supervised-learning/https://machinelearningmastery.com/time-series-forecasting-supervised-learning/https://github.blog/2019-01-24-the-state-of-the-octoverse-machine-learning/https://github.blog/2019-01-24-the-state-of-the-octoverse-machine-learning/https://www.civicsolar.com/article/how-does-heat-affect-solar-panel-efficiencieshttps://www.civicsolar.com/article/how-does-heat-affect-solar-panel-efficiencieshttps://towardsdatascience.com/normalization-vs-standardization-quantitative-analysis-a91e8a79cebfhttps://towardsdatascience.com/normalization-vs-standardization-quantitative-analysis-a91e8a79cebfhttps://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02
-
References
[12] Homer Energy. (2020, Apr.) Standard Test Conditions.
Retrieved 2020-04-023. [Online]. Available:
https://www.homerenergy.com/products/pro/docs/latest/standard test
conditions.html
[13] J. Hurwitz and D. Kirsch, Machine Learning for Dummies.
Wiley Publishing,2018, p. 4, ISBN: 978-1-119-45495-3.
[14] E. M. Kleinberg, “Stochastic discrimination,” Annals of
Mathematics and Artifi-cial Intelligence. 1 (1–4): 207–239,
1990.
[15] G. Knier. (2020, Mar.) How do Photovoltaics Work. Retrieved
2020-04-01. [Online]. Available:
https://science.nasa.gov/science-news/science-at-nasa/2002/solarcells
[16] T. Lombardo. (2014, Apr.) What Is the Lifespan ofa Solar
Panel? Retrieved 2020-04-20. [Online]. Avail-able:
https://www.engineering.com/DesignerEdge/DesignerEdgeArticles/ArticleID/7475/What-Is-the-Lifespan-of-a-Solar-Panel.aspx
[17] S. Madougou and R. Adamou, “Impacts of cloud cover and dust
on theperformance of photovoltaic module in niamey,” Journal of
Renewable Energy,2017. [Online]. Available:
https://doi.org/10.1155/2017/9107502
[18] J. Marsh. (2019, May) How hot do solar panels get? Effect
of temperatureon solar performance. Retrieved 2020-04-07. [Online].
Available:
https://news.energysage.com/solar-panel-temperature-overheating/
[19] G. Masson and I. Kaizuka, Trends in Photovoltaic
Applications. InternationalEnergy Agency, Aug. 2019, pp. 75, 94,
ISBN: 978-3-906042-91-6.
[20] N. Matthew and R. Stones, Beginning Linux Programming 4th
Edition. WileyPublishing, 2008, p. 148, ISBN:
978-0-470-14762-7.
[21] T. Meyer, “Root mean square error compared to, and
contrasted with, standarddeviation,” Surveying and Land Information
Science, vol. 72, p. 1, 09 2012.
[22] M. Muñoz-Garcı́a, N. Vela, F. Chenlo, and M.
Alonso-Garcia, “Early degradationof silicon PV modules and guaranty
conditions,” Solar Energy, vol. 85, 06 2011.[Online]. Available:
http://oa.upm.es/10081/2/doi.10.1016.j.solener.2011.06.011.pdf
[23] P. Pandey. Data preprocessing : Concepts. Retrieved
2020-06-09. [Online]. Avail-able:
https://towardsdatascience.com/data-preprocessing-concepts-fa946d11c825
30
https://www.homerenergy.com/products/pro/docs/latest/standard_test_conditions.htmlhttps://www.homerenergy.com/products/pro/docs/latest/standard_test_conditions.htmlhttps://science.nasa.gov/science-news/science-at-nasa/2002/solarcellshttps://science.nasa.gov/science-news/science-at-nasa/2002/solarcellshttps://www.engineering.com/DesignerEdge/DesignerEdgeArticles/ArticleID/7475/What-Is-the-Lifespan-of-a-Solar-Panel.aspxhttps://www.engineering.com/DesignerEdge/DesignerEdgeArticles/ArticleID/7475/What-Is-the-Lifespan-of-a-Solar-Panel.aspxhttps://doi.org/10.1155/2017/9107502https://news.energysage.com/solar-panel-temperature-overheating/https://news.energysage.com/solar-panel-temperature-overheating/http://oa.upm.es/10081/2/doi.10.1016.j.solener.2011.06.011.pdfhttp://oa.upm.es/10081/2/doi.10.1016.j.solener.2011.06.011.pdfhttps://towardsdatascience.com/data-preprocessing-concepts-fa946d11c825
-
References
[24] J. Peugh and C. Enders, “Missing data in educational
research: A review of report-ing practices and suggestions for
improvement,” Review of Educational Research- REV EDUC RES, vol.
74, pp. 525–526, 12 2004.
[25] S. Rogers and M. Girolami, A first course in machine
learning, 2nd ed. 6000Broken Sound Parkway NW, Suite 300: Taylor
& Francis Group, LLC, 2016.
[26] S. Stettler, et al, “Failure detection routine for grid
connectedPV systems as part of the PVSAT-2 project,” 2005.
[On-line]. Available:
https://www.semanticscholar.org/paper/FAILURE-DETECTION-ROUTINE-FOR-GRID-CONNECTED-PV-AS-Stettler-Toggweiler/38975c32afbbf06c9134fae6b48de936211e2ff6
[27] S. Silvestre, C. Aissa, and E. Karatepe, “Automatic fault
detection in gridconnected PV systems,” Solar Energy, vol. 94, 06
2013. [Online].
Available:https://doi.org/10.1016/j.solener.2013.05.001
[28] SMHI. (2020, Mar.) SMHI Open data. Retrieved 2020-04-01.
[On-line]. Available:
http://www.smhi.se/data/meteorologi/ladda-ner-meteorologiska-observationer#param=airtemperatureInstant,stations=all
[29] SMHI. (2020, Mar.) SMHI Questions and Answer. Retrieved
2020-04-01.[Online]. Available:
http://www.smhi.se/data/oppna-data/tekniska-fragor-och-svar-1.76975
[30] SMHI. (2020, Mar.) SMHI STRÅNG. Retrieved 2020-04-01.
[On-line]. Available:
https://www.smhi.se/forskning/forskningsomraden/atmosfarisk-fjarranalys/strang-en-modell-for-solstralning-1.329
[31] SMHI. (2020, Mar.) STRÅNG Data Extraction. Retrieved
2020-04-01. [Online].Available:
http://strang.smhi.se/extraction/
[32] Solar. (2020, Mar.) Solar Panel Efficiency. Retrieved
2020-04-01. [Online].Available:
https://www.solar.com/learn/solar-panel-efficiency/
[33] The Analysis Factor. Assessing the fit of regression
models. Retrieved 2020-05-22. [Online]. Available:
https://www.theanalysisfactor.com/assessing-the-fit-of-regression-models/
[34] S. Vergura, G. Acciani, V. Amoruso, G. E. Patrono, and F.
Vacca, “Descriptive andinferential statistics for supervising and
monitoring the operation of PV plants,”IEEE Transactions on
Industrial Electronics 56(11):4456 - 4464, 2009.
[35] R. Vidgen, S. Kirshner, and F. Tan, Business Analytics: A
Management Approach.Wiley Publishing, 2019, p. 163, ISBN:
978-1-352-00725-1.
31
https://www.semanticscholar.org/paper/FAILURE-DETECTION-ROUTINE-FOR-GRID-CONNECTED-PV-AS-Stettler-Toggweiler/38975c32afbbf06c9134fae6b48de936211e2ff6https://www.semanticscholar.org/paper/FAILURE-DETECTION-ROUTINE-FOR-GRID-CONNECTED-PV-AS-Stettler-Toggweiler/38975c32afbbf06c9134fae6b48de936211e2ff6https://www.semanticscholar.org/paper/FAILURE-DETECTION-ROUTINE-FOR-GRID-CONNECTED-PV-AS-Stettler-Toggweiler/38975c32afbbf06c9134fae6b48de936211e2ff6https://doi.org/10.1016/j.solener.2013.05.001http://www.smhi.se/data/meteorologi/ladda-ner-meteorologiska-observationer#param=airtemperatureInstant,stations=allhttp://www.smhi.se/data/meteorologi/ladda-ner-meteorologiska-observationer#param=airtemperatureInstant,stations=allhttp://www.smhi.se/data/oppna-data/tekniska-fragor-och-svar-1.76975http://www.smhi.se/data/oppna-data/tekniska-fragor-och-svar-1.76975https://www.smhi.se/forskning/forskningsomraden/atmosfarisk-fjarranalys/strang-en-modell-for-solstralning-1.329https://www.smhi.se/forskning/forskningsomraden/atmosfarisk-fjarranalys/strang-en-modell-for-solstralning-1.329http://strang.smhi.se/extraction/https://www.solar.com/learn/solar-panel-efficiency/https://www.theanalysisfactor.com/assessing-the-fit-of-regression-models/https://www.theanalysisfactor.com/assessing-the-fit-of-regression-models/
-
B Explanatory Variables
A Libraries
Several Python libraries were used throughout this project. But
to mention a few oneswhich stand out. requests, sklearn, pgeocode
and pandas.
Request is an easy to use library for sending HTTP requests. In
relation to this projectit translates to API calls, all of which
are performed with requests.
Sklearn is the library to use when it comes to machine learning,
being able to performclassification, clustering, regression and
more. The library has been used to compare 6models with 8 different
scalers to find a suitable model for prediction of power
output.
Pgeocode is a library for geocoding, i.e converting addresses or
similar information tocoordinates. It supports 83 countries
including Sweden. The library has been used getapproximate
coordinates in the case where no coordinates were found for the PV
systembut the zip code was available.
Pandas is a library used for data manipulation and analysis. The
Library has been usedto read CSV files to a pandas specific format,
DataFrame, and then manipulate the data.Typical usage is when
preprocessing the data.
B Explanatory Variables
For training the machine learning model the following
explanatory variables were used:
• Air temperature
• Air humidity
• Precipitation
• Cloud coverage
• Air pressure
• Solar irradiance
• Previous output
• The hour in the date converted with a sine function
• The hour in the date converted with a cosine function
32
-
C Models and Scalers
• The month in the date converted with a sine function
• The month in the date converted with a cosine function
All of the meteorological parameters listed above were gathered
from the SMHI APIapart from solar irradiance which was collected
using STRÅNG.
C Models and Scalers
For the comparison of different regression models the following
were used:
• Random Forest Regression
• Linear Regression
• Decision Tree Regression
• Neural Network (MLP)
• LASSO
• K-Nearest Neighbour Regression
The scalers that were used in combination with the above
models:
• PowerTransformer-Yeo-Johnson
• RobustScaler
• Standardization
• Normalization
• MinMaxScaler
• MaxAbsScaler
• QuantileTransformer-Normal
• QuantileTransformer-Uniformed
All models and scalers are imported from the Python Library
sklearn.
33
-
D Data Set
D Data Set
A picture of how a data set, used in creating the expected
output, looks like. Every rowrepresents the meteorological data
together with columns added in the preprocessingsteps for one
hour.
34
-
D Data Set
Figure 7 Example of what a dataset for a PV system can look
like
35
-
E Decrease Tables
E Decrease Tables
Tables for 50% and 60% decrease are presented below. There
exists four tables for eachdecrease. Where each table have
different time horizon, e.g 21, 14, 7 or 3 days.
Threshold % True Positive False Positive False Negative Average
days until error found20 37 5 0 21.00025 39 3 0 21.00030 41 1 0
21.00035 42 0 0 21.28140 37 0 5 34.54845 20 0 22 62.47150 10 0 32
64.15055 3 0 39 72.56960 0 0 42 No Errors Found
Table 7 40% decrease with a time horizon of 21 days.
Threshold % True Positive False Positive False Negative Average
days until error found35 40 2 0 14.32840 37 1 4 27.70445 26 0 16
55.58550 17 0 25 62.55455 7 0 35 64.25660 2 0 40 75.146
Table 8 40% decrease with a time horizon of 14 days.
Threshold % True Positive False Positive False Negative Average
days until error found35 36 6 0 7.40540 36 5 1 11.66445 32 1 9
48.51050 25 0 17 59.32255 14 0 28 59.39360 7 0 35 54.798
Table 9 40% decrease with a time horizon of 7-days.
36
-
E Decrease Tables
Threshold % True Positive False Positive False Negative Average
days until error found35 15 27 0 3.43140 15 27 0 4.71445 16 22 4
21.11250 17 14 11 50.10055 19 9 14 56.81860 22 4 16 54.59165 24 2
16 63.56670 14 0 28 57.06675 5 0 37 56.008
Table 10 40% decrease with a time horizon of 3 days.
Threshold % True Positive False Positive False Negative Average
days until error found20 37 5 0 21.00025 39 3 0 21.00030 41 1 0
21.00035 42 0 0 21.00040 42 0 0 21.00045 42 0 0 21.00050 42 0 0
21.00055 42 0 0 21.25660 37 0 5 36.440
Table 11 60% decrease with a time horizon of 21 days.
Threshold % True Positive False Positive False Negative Average
days until error found35 40 2 0 14.00040 41 1 0 14.00045 42 0 0
14.00050 42 0 0 14.00055 42 0 0 14.28160 38 0 4 27.509
Table 12 60% decrease with a time horizon of 14 days.
37
-
E Decrease Tables
Threshold % True Positive False Positive False Negative Average
days until error found35 36 6 0 7.00040 37 5 0 7.00045 41 1 0
7.00150 42 0 0 7.02555 42 0 0 7.27260 41 0 1 11.246
Table 13 60% decrease with a time horizon of 7-days.
Threshold % True Positive False Positive False Negative Average
days until error found35 15 27 0 3.00040 15 27 0 3.00045 20 22 0
3.00050 28 14 0 3.16155 33 9 0 3.18660 38 4 0 4.33265 34 2 6
33.65970 27 0 15 49.89275 26 0 16 59.502
Table 14 60% decrease with a time horizon of 3 days.
38
IntroductionBackgroundAn Overview of Solar CellsFactors
Affecting Power OutputSTRÅNG - A Solar Irradiance ModelMachine
Learning ConceptsRegression vs ClassificationData
PreprocessingEvaluation MetricsModel Validation
Purpose, Aims, and MotivationDelimitations
Related WorkA Statistical Method to Find Faults Comparing
Simulated Output with Measured Output
MethodProgramming LanguageMachine Learning Model - Random Forest
RegressionScoring Method and Validation TechniqueMeteorological
Data
System StructureRequirements and Evaluation MethodsRegression
Model TestingFault Detection Testing
Data GatheringWeather Data from SMHISolar Irradiance Data from
STRÅNGData Format
Data PreprocessingRemoval of DataConstructing and Adding
Data
Fault DetectionExpected OutputFinding FaultsSimulating
Faults
Evaluation resultsRegression Model ResultsFault Detection
Results
Results and DiscussionRegression ModelFeature
ImportancePrecision of the Meteorological DataExpected Output -
Overshooting and UndershootingFault Detection - Analysis
ConclusionsFuture WorkLibrariesExplanatory VariablesModels and
ScalersData SetDecrease Tables