Title: A Compositional Approach For Modelling SDG7 Indicators Author: Juan Carlos Marcillo Delgado Advisors: María Isabel Ortego and Agustí Pérez Foguet Department: Civil and Environmental Engineering. Applied Mathematics and Statistics Section. University: Universitat Politècnica de Catalunya Academic year: 2017/2018 Interuniversity Master in Statistics and Operations Research UPC-UB
99
Embed
Interuniversity Master in Statistics and Operations ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Title: A Compositional Approach For Modelling SDG7 Indicators Author: Juan Carlos Marcillo Delgado Advisors: María Isabel Ortego and Agustí Pérez Foguet Department: Civil and Environmental Engineering. Applied Mathematics and Statistics Section. University: Universitat Politècnica de Catalunya Academic year: 2017/2018
Interuniversity Master in Statistics and
Operations Research UPC-UB
ii
Dedicado a mi companera de lucha, mi amiga fiel
que siempre me hace emerger de las cenizas:
Jacqueline Tatiana Hidrobo Morales
.
iii
Acknowledgments
Agradezco a Dios, por darme la salud, la fortaleza y sobre todo su luz para culminar esta maestrıa en Estadıstica.
En segundo lugar a la UPC y la UB por generar un ambiente de calidad y estar dotada de profesores que hacen que
el aprendizaje sea mas efectivo.
A mi familia por brindarme su apoyo constante. A Moncerrate que siempre monitorea mis avances en las tesis,
a Fatima por permitirme ser parte de su alegrıa, a Angel por sus sabios consejos y a Marıa y Mercedes que han
superado la categorıa de hermanas y son como una madre para mı.
A don Guillermo y dona Anita por haberme acogido en su hogar, por brindarme todas las facilidades a su alcance
en este proceso y por todos los bellos momentos compartidos.
Un agradecimiento especial a Agustı y Maribel por admitirme en su grupo selecto de tesistas, por sus discursos
llenos de base cientıfica y sobre todo por ayudarme a sacarle provecho a esta etapa de aprendizaje.
iv
Abstract
The monitoring of indicators related to the energy sector has acquired a renewed interest with the 2030 agenda on
sustainable development goals (SDG), specifically with the goal seven that seeks to guarantee universal access to
energy. The broad-based nature of energy has promoted the use of substantive, broadly indicative and effective
metrics that allow to capture the different dimensions of energy access. A relevant characteristic of these indica-
tors is that they can be expressed as proportions or can be disaggregated from a whole, i.e, they are compositions.
This type of indicators have their own characteristics which make not suitable to implement traditional multivariate
models on them. The mathematical structure and methods developed for the treatment of compositions is Com-
positional Data Analysis (CoDa). Following this methodology, a log-ratio transformation can be chosen to bring
these indicators to the space of real numbers, and then apply any multivariate method. The scope of this TFM is
to apply compositional models based on an isometric log-ratio transformation to follow up on temporary indicators
of the energy sector in the context of SDG7. The electricity access indicator was selected to develop this aim.
The existing dichotomy between the urban and rural sectors is considered. This dichotomy is very important since
the problem of electricity access is predominantly rural. It is presented an analysis for five countries (Bangladesh,
India, Kenya, Nigeria and Sudan) belonging to the areas most affected by the problem of electricity access, such as
the Sub-Saharan region and the South of Asia. It is concluded that CoDa facilitates a more controlled management
of the parts that make up the indicator, especially when it comes to making inferences outside the calibration range.
Three statistical methods have been used: a traditional one which the majority is related to (Linear Regression),
another based on linear predictors that involve the use of smoothing functions for covariates such as (Generalized
Additive Model) and the other based on optimization algorithms (ε−SVM).
Keywords— SDG7, Electricity access, Compositional data analysis, Trend analysis.
v
Resumen
El monitoreo de indicadores relacionados con el sector energetico ha adquirido un renovado interes con la agenda
2030 que trata los Objetivos del Desarrollo Sustentable (ODS), especıficamente con el objetivo siete que busca
garantizar el acceso universal a energıa. La naturaleza amplia del sector energıa ha promovido el uso de indicadores
sustantivos, ampliamente indicativos y efectivos que permitan capturar las diferentes dimensiones de acceso a
energıa. Una caracterıstica relevante de estos indicadores es que pueden establecerse como proporciones o pueden
desagregarse de un todo, es decir, son composiciones. Este tipo de indicadores tienen sus propias caracterısticas,
que entre otras cosas, hacen que no sea adecuado la implementacion de los modelos multivariados tradicionales. La
estructura matematica y metodos desarrollados para el tratamiento de composiciones se denomina Analisis de Datos
Composicionales (CoDa). Siguiendo esta metodologıa, se debe escoger una transformacion log-ratio que permita
usar estos indicadores en el espacio de los numeros reales y con ello aplicar cualquier metodo multivariado. El
objeto de este TFM es aplicar modelos de composicion basados en una transformacion isometrica log-ratio para
hacer un seguimiento de los indicadores temporales del sector energetico en el contexto del ODS 7. El indicador de
acceso a la electricidad fue seleccionado para desarrollar este objetivo. Para ello se considero la dicotomıa existente
entre el sector urbano y rural. Esta dicotomıa es muy importante puesto que el problema de acceso a electricidad
es predominantemente rural. Para efectos de la presente se realizo un analisis de cinco paıses (Bangladesh, India,
Kenia, Nigeria y Sudan), los cuales pertenecen a las areas mas afectadas por el problema de acceso a electricidad,
como es la region de Africa subsahariana y el sur de Asia. Se concluye que CoDa facilita una gestion mas controlada
de las partes que componen el indicador, especialmente cuando se trata de hacer inferencias fuera del rango de
calibracion. Para esto se utilizaron tres metodos estadısticos: uno tradicional con el que la mayorıa esta relacionada
(Regresion lineal), otro basado en predictores lineales que implican el uso de funciones de suavizamiento para
covariables como es el caso de (Modelo Aditivo Generalizado) y otro basado en algoritmos de optimizacion (ε−
MVS).
Palabras clave— ODS 7, Acceso a electricidad, Analisis de datos composicionales, Analisis de tendencias.
2.2 Contextualization of universal energy access problem
There is a huge disparity in energy use. Roughly, the poorer three-quarters of the world’s population use only 10%
of the world’s energy (Bazilian et al., 2010b) and generally 90% of that energy is for heating and cooking and
the rest (10%) for lighting and entertainment needs (Bhattacharyya and Ohiare, 2012). Most of the energy access
affected people are concentrated in Sub-Saharan Africa, South Asia and some developing countries where India
has the largest population without electricity access.
Recent metrics suggest that globally there are 1.060 million people lack access to the electricity grid. Where more
than two-thirds of those lacking electricity access are concentrated in 12 countries (Bhattacharyya, 2006). Most of
this unelectrified population resides in Sub-Saharan Africa (55%) and some developing countries from Asia (41%),
where India represents the (23%) of people without adequate energy access lives (Cozzi et al., 2017).
In addition, it should be highlighted the precarity of energy use as a factor associated with low rates of electricity
consumption. For instance, it is typical to have grid capacity problems in countries like Bangladesh, especially
at peak load times (Groh et al., 2016). Furthermore, developing countries strongly depend on fossil fuel for elec-
tricity generation (Magnani and Vaona, 2016) which leads, among other problems, those related to health and the
6
Chapter 2. Literature Review
environment.
On the cooking fuels side, there are 2500 million people in the world relying on biomass where 66% are represented
for developing countries of Asia, standing out the relevance of India (31%) and China (12%) and considering Sub-
Saharan Africa (33%) . All this represents 99% of biomass world consumption (Cozzi et al., 2017).
Although Sub-Saharan Africa and South Asia are the most affected regions by energy access, it should be noted that
the dimensionality of the problem is different on them. For instance, for cooking purposes in Sub-Saharan Africa
it is very common the predominance of solid fuels. Coal and charcoal is more used in China (and to some extent
in India) while gas (LPG, bio gas and natural gas) is more used in developing countries outside from Sub-Saharan
Africa regions (Bhattacharyya and Ohiare, 2012).
On the other hand, it is interesting to highlight that one of the greatest challenges to achieving full electrification
is the dichotomy between rural and urban areas (Mensah et al., 2014). In fact, some authors consider that energy
access is predominantly a rural problem (Bhattacharyya and Ohiare, 2012). For instance, in Sub-Saharan Africa
23% of rural population have electricity access while in urban areas 74% of them have electrification. A study
conducted by Doll and Pachauri (2010) showed that one of the causes in delay electrification process in Sub-
Saharan Africa is that there are large populations living at low densities.
The lack of modern energies makes it common to adopt other traditional sources in rural areas. It is very common
to adopt kerosene lamps in order to meet household lighting demand in developing countries. But maintaining this
type of energy alternatives in the long term generates a stagnation in the people quality life, For instance, they do
not obtain sufficient lighting for studying in a house at night, and this is one of the obstacles to achieve higher
educational attainment (Kanagawa and Nakata, 2008).
It is worth noting that one of the main predominant characteristics of the region without access to electricity is
poverty and it is the major obstacle for ensure universal energy access. In the case of India, energy provision
has two dominant characteristics: (1) strong public sector presence and (2) prevalence of excessive subsidies and
cross-subsidies (Bhattacharyya, 2006) and it will represent a great barrier for low-income countries.
Finally, in order to achieve universal energy access in 2030 could be considered experiences of those countries that
have obtained good results in their energy supply process Thailand went from 25% access to 100% electricity in
less than a decade (Pachauri et al., 2012). China has successfully developed rural diversification energy projects
over the last few years and achieved a great feat of almost 100% power supply and projects like grid extension has
emerged as the preferred mode of electrification in almost all successful cases (Bhattacharyya and Ohiare, 2012).
2.3 The context of the indicators for measuring universal energy access
The obstacles to widespread electricity access are largely well known . However, one of the main concerns from the
2030 agenda it has been the availability of indicators to monitor this target, which is why the issue of availability,
7
Chapter 2. Literature Review
versatility and the source of this type of indicators has been very approached (Mensah et al., 2014; Hailu, 2012).
In general an indicator is a quantitative or a qualitative measure derived from a series of observed facts that can
reveal relative positions (e.g. of a country) in a given area. Indicators are useful in identifying trends and drawing
attention to particular issues. They can also be helpful in setting policy priorities and in benchmarking or monitoring
performance (Commission et al., 2008).
To ensure such a complex goal as the SDG7 for 2030 involves a robust information-base which serves as a support
tool that allows tracking and comparing objectives (Nussbaumer et al., 2012). Nevertheless, the broad-based nature
of energy, requires the development of energy access measures be substantive, broadly indicative and effective in
capturing the different dimensions of energy access (Hailu, 2012; Groh et al., 2016; Mensah et al., 2014).
Thereby, within the existing debate on energy access indicators many researchers highligh the importance of ad-
dressing the multidimensional nature of energy access. For instance, Nussbaumer et al. (2012) in their speech
about uni- versus multi-dimensionality foregrounds that complex issues such as human development are multidi-
mensional in their very nature and their assessment therefore requires a framework in which various elements can
be captured.
Despite the importance of more disaggregated indicators. The planning scenarios and forecasting methods in
energy access are usually simple, focus primarily on one metric, namely, installed generation capacity (Bazilian
et al., 2012) or said in a more specialized language, electricity demand. Planning process based on a single metric
is commonly more appealing from a communication viewpoint but it neglect other factors (e.g. population, energy
mix, efficiency) that complicate the background of the problem (Bazilian et al., 2010a).
Fortunately, with the implementation of the SDG7, many efforts have been directed towards the diversification of
energy indicators and data quality. Many of these contributed by entities such as the World Bank or the International
Energy Agency IEA which have the advantage of being commonly applicable across countries and allowing cross
comparability.
One of the recent advances in order to assess energy access for households, productive entities, and communities
along several dimensions of access is the multi-tier framework (MTF) which is used to reflect both aggregated and
dissected analyses as be possible (Groh et al., 2016).
Thus, it becomes clear the need of disaggregated indicators and consequently of methods that make easier the
modelling of them. Achieve this task will allow to take advantage of the information richness that this kind of data
provides against the use of only uni-dimensional indicators.
Finally, one can mention as other limiting factor for the correct monitoring of access to electricity the poor and
inconsistent national statistics that could block cross-country analysis and undermine efforts to implement global
or regional programmes. Still, a lack of data should not be use as a justification for delaying building national
energy planning capability and developing energy plans (Bazilian et al., 2012).
8
Chapter 2. Literature Review
2.4 The compositional character of universal energy access indicators
One of the main causes that limits multidimensional indicators use is the complexity existing around the correct
handling of them (Hailu, 2012). For example, if any statistical classical analysis is carried out on all the parts that
make up a multidimensional indicator of proportions with unit-sum constraint, it can lead to erroneous interpreta-
tions since the hidden influence of this type of indicators is neglected (Reimann et al., 2017). Fortunately, there are
significant advances in the management of multidimensional indicators, especially those that can be established as
proportions, or can be disaggregated from a total.
Aitchison (1986) aware of the existence of some multidimensional indicators type expressed mathematically in
vectors of proportions play an important role in many disciplines and often they display appreciable variability
from vector to vector. He developed a simple and appropriate statistical methodology for the adequate investigation
and interpretation of this kind of data. This branch of statistics is known as Compositional data analysis (CoDa).
Aitchison (1986) uses the term composition to refer to any vector x with non-negative elements X1, ..., Xn rep-
resenting proportions of some whole and any Xi element is called component. One of the contributions of this
methodology is the incorporation of the the unit-sum constraint within the statistical modelling to eliminate any
distortion or doubt about the inference over compositional indicators.
Other advantage about CoDa is that this approach makes it possible to perform classical statistical analysis (e.g.
linear regression, principal component analysis) which is not suitable when using only the raw compositional data
and this is a clear advantage due to the large number of normally distributed available methods for multivariate
phenomena and the robustness of those. The only prerequisite you need is to work on transformed data using a
log-ratio approach and back-transform the results (Pawlowsky-Glahn and Egozcue, 2001).
Within the usual log-ratio transformations, the first developed transformation was the additive log-ratio (alr), which
consisted in choosing an arbitrary component as a divisor (which represented a more conceptual than practical
problem). To avoid this arbitrariness problem the composition is divided by the geometric mean resulting in the
centered log-ratio (clr), but the disadvantage was that the clr covariance matrix is singular. Then, the recognition
that compositions can be represented in coordinates with orthonormal bases (isometric log ratio - ilr) helped to
avoid the arbitrariness of the alr and the singularity of the clr Pawlowsky-Glahn and Buccianti (2011).
Thus, CoDa has been the source of many discussions in practice due to the enormous importance compositional
data have in applied sciences (Pawlowsky-Glahn and Egozcue, 2001). Compositional analysis has opened a space
in the research field especially in researches related to chemical compositions of rocks and sediments at different
depths (Flood et al., 2016) which are compositional study fields by nature (Reimann et al., 2017). However, one of
Aitchison’s major concerns was related to household surveys, especially when it comes to studying certain variables
as expenditure composition, the consumer demand study, including fuel and light consumption (Aitchison, 1986).
In research on energy access there are many compositional indicators, among which we can mention the total
9
Chapter 2. Literature Review
primary energy (TPE) used in the paper of Parajuli et al. (2014) which is the sum of residential, commercial, trans-
port, agricultural and others primary energies. The simple fact of difference between urban and rural population
with access to electricity is already a composition. The multi-tier framework (MTF) that assesses energy access for
households uses fractional measurements between tier 0 and 1 (Groh et al., 2016) and therefore it is a compositional
indicator.
It is worth noting that although compositional data analysis is a practical approach to address measures of propor-
tions, there is a high reluctance in the monitoring of energy access indicators, either due resistance to new theories
or due to that energy access multidimensional character starts to be taken into account with the 2030 agenda.
2.5 Review of compositional statistical models related with Universal En-
ergy Access
In this section we proceed to a literature review on statistical models related to universal access to energy where the
compositional character of this field of study has been considered. The main objective is to highlight how compo-
sitional indicators are becoming relevant with the agenda 2030 and how compositional data analysis methodology
can contribute to the robustness of these models if considered.
The first model in discussion is an econometric model proposed for Panos et al. (2016) to estimate electricity access
(% of population total) based on a ordinary least squares linear regression and using the covariates: a) percentage
of population living with less than $2 per day (poverty covariate) b) the population urbanization rate (urbanization
rate) and c) the average electricity per capita in residential sector.
This is an interesting example because most variates are proportions (i.e. compositions). This author, in spite of
not working with compositional methods, is conscious that the dependent variable responds to a compositional
structure and to deal with this peculiarity the variate was transformed such a logit form ln(electricity access/(1 −
electricity access)) which is a very valid transformation within the vision of CoDa since it transforms the scale and
considers both parties (access and no access to electricity).
Parajuli et al. (2014) performs a more complex model that mixes several compositions within it, using Cobb-
Douglas log-linear models to project the primary energy consumption in Nepal. This model resembles the one
proposed in this project because it uses a composition as response variable, i.e., it consist in several models to
explain each components of total primary energy (residential energy, commercial energy, energy in transport and
energy in agricultural sector). In addition, some of its covariates include other compositions such as the disaggre-
gation of GDP by sector (commercial, agricultural, industrial) and population total (disaggregated by urban and
rural).
If the compositional part were neglected, the model would be reduced to a linear regression with estimation prob-
lems, but approaching it from a compositional viewpoint is a important challenge because you have to deal with
10
Chapter 2. Literature Review
three compositions at the same time which is not a problem with CoDa but in this case you lose control over
unit-sum constraint for the three compositions which generates many doubts when making inference since the
components are out of control. The use of neperian logarithm as a transformation for urban and rural population
is a bit redeemable but one must be very cautious with interpretations since when one of the parties increases the
other must decrease.
By other side Magnani and Vaona (2016) performed a panel data model where it was preferred a linear model to
a log-linear one not to constrain the elasticity of the dependent variable with respect to the independent ones to be
constant throughout the sample. The percentage of the population with access to electricity was used as dependent
variable. Five models were estimated. In this example the author chooses to include as an explanatory variable the
urban or rural population but not both at the same time.
In this case, the compositional character of the variables is totally omitted. According with Van den Boogaart and
Tolosana-Delgado (2013) most multivariate methods developed for data with real value give misleading results for
compositional data and can lead to spurious correlations. For this example the correlation structure between elec-
tricity access, urban and rural population gave negative values, contradicting the usual interpretations of correlation
and covariance, among other things that independence is usually related to zero correlation. Moreover, composi-
tional researchers are very critical when one of the parts of the whole is omitted because you cannot model and
interpret compositional indicators correctly.
As highlighted in this section, with the 2030 agenda a series of statistical models have been carried out, using known
statistical tools such as the linear regression estimated by ordinary least squares. However, there is no evidence of
models that use CoDa as a tool that helps improve statistical robustness.
Within the models to be used in this thesis, besides OLS, there are Generalized Additive Models (GAM) and
Support Vector Machine (SVM). To which is added that there is no evidence of SVM using CoDa, although it is a
widely used model in the energy field (Suganthi and Samuel, 2012; Ekonomou, 2010), especially to predict energy
demand from a one-dimensional approach.
About GAM, a recent study conducted by Perez-Foguet et al. (2017) related to goal six of the SDG: Ensure avail-
ability and sustainable management of water and sanitation for all, where linear regression was also used, it was
shown that CoDa is a useful tool that can help improve temporary interpolations for trend models, which serves as
a reference for the present study.
11
Chapter 3
Research Methodology
3.1 Unit of Analysis
This study is replicable to most countries that are included in the database of access to electricity that is on the
World Bank website. However, for the purposes of this thesis it was decided to select the most representative ones.
In this sense, it was selected certain countries within the area with greater problems of energy access in the world
such as the Sub-Saharan region and southern Asia, considering additionally other aspects such as the existence of
data enough to estimate a statistical model. In general, most countries had information available from 1990 to 2014.
These countries are Bangladesh, India, Kenya, Nigeria and Sudan. As shown in figure 3.1.1 nine countries represent
around the 63% of the population with problems of electricity access in the world where the countries selected
represent around the 50%. However, it is worth noting that the models made here are replicable for the 212
countries in the World Bank database that is the source of information from which the data was taken.
3.2 Study Variables
3.2.1 World Bank methodology for collecting electricity access data
Data for monitoring access to electricity are collected among different sources: mostly data from nationally rep-
resentative household surveys (including national censuses) were used. Survey sources include Demographic
and Health Surveys (DHS) and Living Standards Measurement Surveys (LSMS), Multi-Indicator Cluster Surveys
(MICS), the World Health Survey (WHS), other nationally developed and implemented surveys, and various gov-
ernment agencies (for example, ministries of energy and utilities) (WorldBank, 2017).
Given the low frequency and the regional distribution of some surveys, a number of countries have gaps in the
12
Chapter 3. Research Methodology
1.89 25.7
Concentration of the 63% world population without electricity access
Bangladesh
Congo
Ethiopia
India
Kenya
Myanmar
Nigeria
Sudan
Uganda
Figure 3.1.1: Countries with large population without electricity access in 2010. Source: WorldBank (2017).The colors of the palette represent the distribution of the 63% world’s population without access to electricitythrough nine countries. The other 37% is in the rest of the world. Countries close to the green color have less
population without electricity than those that there are closer to the wine color, specially India who concentratesthe 25,7% of population without electricity access.
available data. To develop the historical evolution and starting point of electrification rates, a simple modelling
approach was adopted to fill in the missing data points - around 1990, around 2000, and around 2010. Therefore, a
country can have a continuum of zero to three data points (WorldBank, 2017).
There are 42 countries with zero data point and the weighted regional average was used as an estimate for elec-
trification in each of the data periods. 170 countries have between one and three data points and missing data are
estimated by using a model with region, country, and time variables (ibid.).
The model keeps the original observation if data is available for any of the time periods. This modelling approach
allowed the estimation of electrification rates for 212 countries over these three time periods (Indicated as ”Esti-
mate”). Notation ”Assumption” refers to the assumption of universal access in countries classified as developed by
the United Nations (ibid.).
13
Chapter 3. Research Methodology
3.2.2 Dependent variable
In the present study the dependent variable or variable to explain is the composition of electricity access (access,
without access), disaggregated by sector (urban, rural), in total it is a composition of four parts. Next, the description
of each one of them and in parenthesis the pseudonym that will be assigned to them in this study:
x1 : Urban population with electricity access (urban).
x2 : Rural population with electricity access (rural).
x3 : Urban population without electricity access (nourban).
x4 : Rural population without electricity access (norural).
The variables used by the World Bank for the construction of this indicator were:
A : Access to electricity, urban (% of urban population)
B : Access to electricity, rural (% of rural population)
C : Urban population (% of total)
D : Rural population (% of total population)
Where, table 3.2.1 reflects how the dependent variable was created using the variables of the World Bank exposed
from A to D:
Table 3.2.1: Formulas used for establishing the composition of dependent variable.
Component Pseudonym Formulax1 urban A·C
100·100
x2 rural B·D100·100
x3 nourban C − urbanx4 norural D − rural
Maybe if you are interested in knowing the amount of total population, the variable is registered in the World Bank
with the name Population, total.
3.2.3 Independent variables
Since the present study focuses on representing the trend of the response variable, the explanatory variable is the
time. Nevertheless, there is a variable called Access to electricity (% of population) within the
database of World Bank that in table 3.2.1 would be the sum of x1 + x2.
This variable gives rise to the creation of a contrast variable z, which reflects the existing harmony between Access
to electricity (% of population) and the variables Access to electricity, urban (%
14
Chapter 3. Research Methodology
of urban population) and Access to electricity, rural (% of rural population). With
this inquiry it is detected the formation of two subseries within the components of the dependent variable. The cre-
ation of this binary explanatory variable Z is given by equation 3.1.
x1 : Urban population with electricity access (urban).
x2 : Rural population with electricity access (rural).
T : Access to electricity (% of population).
z =
0, if T = x1 + x2
1, otherwise(3.1)
3.3 Statistical Analysis of Compositional Data
This section is dedicated to describe CoDa aspects that are related to this thesis. The first section is framed to
clarifying the compositions as portions of a total. The second section, using more mathematical terms gives a clear
vision of what a composition is, and relate the reader to the CoDa terminology. The third section discloses the three
basic principles on which CoDa is based.
Next, it is introduced the vector space structure used in CoDa. Then it shows the compositional observations in
real space and the need to transform the data, for example with an isometric log ratio (the transformation used in
the results chapter), also showing a very simple method for this procedure, as is the SBP method and finally it is
introduced the principle of working in coordinates to apply any standard statistical process.
It is important to mention that this section was developed using the book Pawlowsky-Glahn et al. (2015), except in
those sections where another author is directly mentioned.
3.3.1 Compositions are portions of a total
A dataset is called compositional if it provides portions of a total. The individual parts of the composition are
called components. Each component has an amount, representing its importance within the whole. Amounts can be
measured as absolute values, in amount-type physical values like money, time, volume, mass, energy, molecules,
individuals, and events (Van den Boogaart and Tolosana-Delgado, 2013).
The sum over the amounts of all components is called the total amount or, short, the total. Portions are the individual
amounts divided by this total amount. Depending on the unit chosen for the amounts, the actual portions of the
parts in a total can be different (Van den Boogaart and Tolosana-Delgado, 2013).
15
Chapter 3. Research Methodology
3.3.2 Basic concepts
In this section we discuss some fundamental concepts to understand CoDa, all of them help to understand our ob-
jective variable. For instance, the first definition applied to our study clarifies that our response variable, access to
electricity, is a composition of D = 4 parts (urban, rural, nourban, norural).
Definition 3.1. (D-part composition).
A (row) vector, x = [x1, x2, ..., xD], is a D-part composition when all its components are strictly positive real
numbers and carry only relative information.
The second definition helps us to clarify that within our study there could be compositions that are multiple of
others.
Definition 3.2. (Compositions as equivalence classes).
Two vectors of D positive real components x, y ∈ RD+(xi, yi > 0, for all i = 1, 2, ..., D) are compositionally
equivalent if there exists a positive constant λ ∈ R+ such that x = λ · y
Knowing what a closed operation is it becomes evident that in the present study, the four-part composition, elec-
tricity access, is a closed composition and always adds k = 1.
Definition 3.3. (Closure).
For any vector of D strictly positive real components,
z = [z1, z2, ..., zD] ∈ RD+ , zi > 0∀i = 1, 2, ..., D
the closure of z to K > 0 is defined as
C(z) =
[K · z1∑D
i=1
,K · z2∑D
i=1
, ...,K · zD∑D
i=1
,
]
Additionally, it is clear that each country that is modeled includes a sample space, a simplex:
Definition 3.4. (Sample space).
The sample space of compositional data is the simplex,
S =
{x = [x1.x2, ..., xD]
∣∣∣∣∣xi > 0, i = 1, 2, ..., D;
D∑i=1
xi = K
}
16
Chapter 3. Research Methodology
Furthermore, if it were decided to analyze only the population with access to electricity, leaving aside the popula-
tion without access to electricity, it would be analyzing only part of the proposed whole, a subcomposition. This
definition also makes clear that all compositions are subcompositions, since in the case of the dependent variable,
this could be disaggregated into other factors.
Definition 3.5. (Subcomposition).
Given a composition x and a selection of indices S = {i1, ..., iS}, a subcomposition xS , with S parts, is obtained
by applying the closure operation to the subvector [xi1 , xi2 , ..., xiS ] of x. The set of subscripts S indicate which
parts are selected in the subcomposition, not necessarily the first S ones.
Finally, each time an indicator is added, it is done an amalgamation process, it means that the wealth of the indicator
is being removed.
Definition 3.6. (Amalgamation).
Given a composition x ∈ SD, and a selection of a indices A = {i1, ..., ia} (not necessarily the first ones), D−a ≥
1, and the set of remaining indices A, the value
xA =∑j∈A
xi
is called amalgamated part or amalgamated component. The vector x′ = [xA, xA], containing the components with
subscript in A grouped in xA and the amalgamated component xA , is called amalgamated composition which is
in SD−a+1.
3.3.3 Principles of compositional analysis
The principles presented in this section are the basis for a robust compositional data analysis and the reason why
CoDa is a useful tool in satisfying the three fundamental principles of compositional data:
Scale invariance
In the absence of information about the total (total power production or mass of sediment), it is highly reasonable
to expect analyses to yield the same results, in whichever way that total evolved. This is known as scale invariance
(Aitchison, 1986).
17
Chapter 3. Research Methodology
Permutation invariance
A function is permutation invariant if it yields equivalent results when the ordering of the parts in the composition
is changed. As a little example, it should be the same working with the composition [A,B,C] than with [B,A,C].
Subcompositional coherence
Subcompositional coherence can be practically summarized as: (i) distances between two compositions are equal
or decrease when subcompositions of the original ones are considered; (ii) scale invariance of the results is pre-
served within arbitrary subcompositions, that is, the ratios between any parts in the subcomposition are equal to the
corresponding ratios in the original composition.
3.3.4 Vector space structure
This section describes the basic operations required for a vector space structure of the simplex. The symbol of ”⊕”
is shown in replacement of ” + ” and ” � ” in replacement of ” · ”. They use the closure operation C that was
reviewed in the previous section:
Definition 3.7. (Perturbation).
Perturbation of x ∈ SD by y ∈ SD,
x⊕ y = C[x1y1, x2y2, ..., xDyD] ∈ SD
Definition 3.8. (Powering).
Power transformation or powering of x ∈ SD by a constant α ∈ R,
α� x = C[xα1 , xα2 , ..., xαD] ∈ SD
3.3.5 Compositional observations in real space
Compositions in SD are usually expressed in terms of the canonical basis of RD, {e1, e2, ..., eD}. In fact, any
Figure 4.5.2: Forecasting of electricity access in Nigeria by 2030.
46
Chapter 5
Conclusion
Based on what is detailed in the section 4.4 it is confirmed that in order to model tendencies of SDG7 compositional
indicators it is advisable to use LM-ilr, or GAM-ilr, as opposed to standard models (including GAM or SVM). The
main argument is that CoDa facilitates a more controlled management of the parts that make up the indicator,
especially when it comes to making inferences outside the calibration range.
The detailed analysis of the electricity access World-Bank indicator confirms that the data series include, in some
cases, two subseries, characterized by different temporal evolution coefficients. This differentiation was possi-
ble through the validation with other World-Bank indicator Electricity access (% of population
total) and the two parts of the proposed indicator related to electricity access in the urban and rural sectors.
This event very possibly responds to the fact that in the process of data collection there were gaps in available
data and a simple modelling approach was adopted to fill in the missing data points as it is detailed in the research
methodology chapter.
By other hand the relation between electricity access in rural and urban sectors it is handled individually as Access
to electricity, urban (% of urban population) and Access to electricity, rural
(% of rural population). Managing a multidimensional indicator in this way is very difficult, and ne-
glects the sum of the whole. Therefore, the use of CoDa is recommended for an improvement in the management
of this type of indicators. Currently, CoDa has made great progress in imputation techniques and handling of
missings, if this recommendation is implemented it would be a great support for this indicator.
Based on the RMSE within the calibration range the GAM model provides a better fit. It is worth mentioning
that this differences are minimal. It is worth mentioning that these differences are minimal and depending on the
accuracy desired or the approach that you want to give the analysis. A linear regression model with CoDa can be
an excellent option since it is very illustrative and most people are related to this statistical model.
Based on the RMSE outside the calibration range for the predictions of last six observations after the year 2008 and
47
Chapter 5. Conclusion
with observation of the predictions to 2030, GAM or SVM can be considered to make predictions of the temporary
trend of electricity access. The conclusion about SVM is seen in the figure 4.3.1 where SVM is the model that best
predicts. However, these differences in some cases are almost nil and as displayed figure 4.5.1 GAM can lead to
very stable predictions.
Additionally, considering the two subseries identified in most cases of the subseries where there is harmony with the
indicator Electricity access (% of total population),i.e class=same, a better estimate with
GAM is achieved which is very interesting in case you are more interested in this subseries.
Finally, the present work took as a unit of analysis five representative countries in the problem electricity access,
but this analysis is easily replicable to the 212 countries that are in the World-Bank database the only you need is
to chose an orthonormal base, transform your compositional data and run your model.
48
Bibliography
Aitchison, J. (1986). The statistical analysis of compositional data. Chapman and Hall London. 9, 17
Barclay, H., Dattler, R., Lau, K., Abdelrhim, S., Marshall, A., and Feeney, L. (2015). Sustainable Development
Goals. International Planned Parenthood Federation. 2
Bazilian, M., Nussbaumer, P., Cabraal, A., Centurelli, R., Detchon, R., Gielen, D., Rogner, H., Howells, M.,
McMahon, H., Modi, V., et al. (2010a). Measuring energy access: Supporting a global target. Earth Institute,
Columbia University, New York. 8
Bazilian, M., Nussbaumer, P., Rogner, H.-H., Brew-Hammond, A., Foster, V., Pachauri, S., Williams, E., Howells,
M., Niyongabo, P., Musaba, L., et al. (2012). Energy access scenarios to 2030 for the power sector in sub-Saharan
Africa. Utilities Policy, 20(1):1–16. 8
Bazilian, M., Sagar, A., Detchon, R., and Yumkella, K. (2010b). More heat and light. Energy Policy, 38(10):5409–
5412. 6
Bhattacharyya, S. C. (2006). Energy access problem of the poor in India: Is rural electrification a remedy? Energy
policy, 34(18):3387–3397. 3, 6, 7
Bhattacharyya, S. C. and Ohiare, S. (2012). The Chinese electricity access model for rural electrification: Approach,
experience and lessons for others. Energy Policy, 49:676–687. 6, 7
Birol, F. et al. (2013). World energy outlook. Paris: International Energy Agency, 23(4):329. 6
Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: a library for support vector machines. ACM transactions on intelli-
gent systems and technology (TIST), 2(3):27. 27
Chaurey, A., Ranganathan, M., and Mohanty, P. (2004). Electricity access for geographically disadvantaged rural
communities—technology and policy insights. Energy policy, 32(15):1693–1705. 3
Commission, J. R. C.-E. et al. (2008). Handbook on constructing composite indicators: Methodology and user
guide. 8
Cozzi, L. et al. (2017). World energy outlook special report. France International Energy Agency (IEA). 3, 6, 7
49
Bibliography
Doll, C. N. and Pachauri, S. (2010). Estimating rural populations without access to electricity in developing
countries through night-time light satellite imagery. Energy Policy, 38(10):5661–5670. 7
Ekonomou, L. (2010). Greek long-term energy consumption prediction using artificial neural networks. Energy,
35(2):512–517. 11, 37
Everitt, B. and Skrondal, A. (2010). The Cambridge dictionary of statistics. Cambridge University Press Cam-
bridge. 26
Flood, R., Bloemsma, M., Weltje, G., Barr, I., O’Rourke, S., Turner, J., and Orford, J. (2016). Compositional data
analysis of Holocene sediments from the West Bengal Sundarbans, India: Geochemical proxies for grain-size
variability in a delta environment. Applied Geochemistry, 75:222–235. 9
Groh, S., Pachauri, S., and Narasimha, R. (2016). What are we measuring? An empirical analysis of household
electricity access metrics in rural Bangladesh. Energy for Sustainable Development, 30:21–31. 6, 8, 10
Hailu, Y. G. (2012). Measuring and monitoring energy access: Decision-support tools for policymakers in Africa.
Energy Policy, 47:56–63. 8, 9
Kanagawa, M. and Nakata, T. (2008). Assessment of access to electricity and the socio-economic impacts in rural
areas of developing countries. Energy Policy, 36(6):2016–2029. 5, 7
Lantz, B. (2015). Machine learning with R. Packt Publishing Ltd. 1, 27, 28
Lesmeister, C. (2017). Mastering machine learning with r. Packt Publishing Ltd. 28, 29
Magnani, N. and Vaona, A. (2016). Access to electricity and socio-economic characteristics: Panel data evidence
at the country level. Energy, 103:447–455. 6, 11
Mensah, G. S., Kemausuor, F., and Brew-Hammond, A. (2014). Energy access indicators and trends in Ghana.
Renewable and Sustainable Energy Reviews, 30:317–323. 5, 6, 7, 8
Nussbaumer, P., Bazilian, M., and Modi, V. (2012). Measuring energy poverty: Focusing on what matters. Renew-
able and Sustainable Energy Reviews, 16(1):231–243. 8
Onyeji, I., Bazilian, M., and Nussbaumer, P. (2012). Contextualizing electricity access in sub-Saharan Africa.
Energy for Sustainable Development, 16(4):520–527. 3
Pachauri, S., Brew-Hammond, A., Barnes, D., Bouille, D., Gitonga, S., Modi, V., Prasad, G., Rath, A., and Zerrifi,
H. (2012). Energy access for development. Cambridge University Press and IIASA. 5, 7
Panos, E., Densing, M., and Volkart, K. (2016). Access to electricity in the World Energy Council’s global energy
scenarios: An outlook for developing regions until 2030. Energy Strategy Reviews, 9:28–49. 10
50
Bibliography
Parajuli, R., Østergaard, P. A., Dalgaard, T., and Pokharel, G. R. (2014). Energy consumption projection of Nepal:
An econometric approach. Renewable Energy, 63:432–444. 10
Pawlowsky-Glahn, V. and Buccianti, A. (2011). Compositional data analysis: Theory and applications. John
Wiley & Sons. 9
Pawlowsky-Glahn, V. and Egozcue, J. J. (2001). Geometric approach to statistical analysis on the simplex. Stochas-
tic Environmental Research and Risk Assessment, 15(5):384–398. 9
Pawlowsky-Glahn, V., Egozcue, J. J., and Tolosana-Delgado, R. (2015). Modeling and analysis of compositional
data. John Wiley & Sons. 15, 22
Perez-Foguet, A., Gine-Garriga, R., and Ortego, M. I. (2017). Compositional data for global monitoring: The case
of drinking water and sanitation. Science of The Total Environment, 590:554–565. 3, 11
R Core Team (2016). R: A Language and Environment for Statistical Computing. R Foundation for Statistical
Computing, Vienna, Austria. 30
Reimann, C., Filzmoser, P., Hron, K., Kynclova, P., and Garrett, R. (2017). A new method for correlation analysis
of compositional (environmental) data–a worked example. Science of The Total Environment, 607:965–971. 9
Spalding-Fecher, R., Winkler, H., and Mwakasonda, S. (2005). Energy and the World Summit on Sustainable
Development: what next? Energy Policy, 33(1):99–112. 6
Suganthi, L. and Samuel, A. A. (2012). Energy models for demand forecasting—A review. Renewable and
sustainable energy reviews, 16(2):1223–1240. 11
UN (2015). 70/1. Transforming our world: the 2030 Agenda for Sustainable Development-A. Technical report,
RES/70/1. New York, USA: United Nations. 2
Van den Boogaart, K. G. and Tolosana-Delgado, R. (2013). Analyzing compositional data with R, volume 122.
Springer. 2, 11, 15, 19
Weisberg, S. (2005). Applied linear regression, volume 528. John Wiley & Sons. 21, 23
Wood, S. N. (2006). Generalized additive models: an introduction with R. CRC press. x, 26
Wood, S. N. (2017). Generalized additive models: an introduction with R. CRC press. 24
WorldBank (2017). World Bank Open Data free and open access to global development data. 1, 3, 12, 13
51
Appendix A
Cross validation between CoDa and Raw
model outside the calibration rank
This appendix displays some bar plots that complement the section 4.4 related to the cross validation between CoDa
and raw compositions. These figures show the importance of CoDa instead of using raw compositions outside the
calibration rank. The main conclusion is: Outside the calibration rank CoDa is better than work with the raw
compositions.
0.00
0.01
0.02
0.03
0.04
GAM LM SVM
Interpolation method
Roo
t mea
n sq
uare
err
or
TransformationCoDaRaw
Figure A.0.1: Cross validation between CoDa and Raw model outside the calibration rank for Bangladesh usingthe RMSE
52
Appendix A. Cross validation between CoDa and Raw model outside the calibration rank
0.00
0.02
0.04
GAM LM SVM
Interpolation method
Roo
t mea
n sq
uare
err
or
TransformationCoDaRaw
Figure A.0.2: Cross validation between CoDa and Raw model outside the calibration rank for India using the RMSE
0.00
0.02
0.04
GAM LM SVM
Interpolation method
Roo
t mea
n sq
uare
err
or
TransformationCoDaRaw
Figure A.0.3: Cross validation between CoDa and Raw model outside the calibration rank for Kenya using theRMSE
53
Appendix A. Cross validation between CoDa and Raw model outside the calibration rank
0.000
0.002
0.004
0.006
GAM LM SVM
Interpolation method
Roo
t mea
n sq
uare
err
or
TransformationCoDaRaw
Figure A.0.4: Cross validation between CoDa and Raw model outside the calibration rank for Sudan using theRMSE
54
Appendix B
Parameters used in the configuration of the
models
55
Appendix B. Parameters used in the configuration of the models
This appendix shows information related to the parameters of the different applied statistical models in the results
chapter.
B.1 CoDa Models
B.1.1 Linear model with interactions
Table B.1.1: Configurations of the code used to estimate linear CoDa models with interactions.
Balance/country Numeric covariate Factor covariate InteractionBalance 1Bangladesh time class time:classIndia time - -Kenya poly(time,3) class poly(time,2):classNigeria time class time:classSudan time - -Balance 2Bangladesh poly(time,3) class time:classIndia poly(time,2) - -Kenya poly(time,3) class poly(time,2):classNigeria poly(time,2) - time:classSudan poly(time,2) - -Balance 3Bangladesh time class time:classIndia time - -Kenya time class time:classNigeria time class -Sudan time - -
56
Appendix B. Parameters used in the configuration of the models
B.1.2 GAM with interactions
Table B.1.2: Configurations of the code used to estimate GAM CoDa models with interactions.
Balance/country Smooth function parameters s(time) Factor(class)
Complementary parametersbs k fx m by=class gamma method select
Appendix B. Parameters used in the configuration of the models
B.2 Models for Raw data
B.2.1 Linear configuration
Table B.2.1: Configurations of the code used to estimate linear Raw models with interactions.
Balance/country Numeric covariate Factor covariate InteractionUrban s. with electricityBangladesh time class time:classIndia time - -Kenya time class time:classNigeria poly(time,2) class poly(time,2):classSudan poly(time,2) - -Rural s. with electricityBangladesh time class time:classIndia time - -Kenya time class poly(time,2):classNigeria poly(time,2) class poly(time,2):classSudan poly(time,2) - -Urban s. without electricityBangladesh time - -India time - -Kenya time class time:classNigeria poly(time,2) class poly(time,2):classSudan time - -Rural s. without electricityBangladesh time - -India time - -Kenya time class poly(time,2):classNigeria poly(time,3) class poly(time,3):classSudan poly(time,2) - -
60
Appendix B. Parameters used in the configuration of the models
B.2.2 GAM configuration
Table B.2.2: Configurations of the code used to estimate GAM Raw models with interactions.
Balance/country Smooth function parameters s(time) Factor(class)
Complementary parametersbs k fx m by=class gamma method select
Urban s. with electricityBangladesh cs 5 TRUE 2 Yes Yes 1.0517752 P-REML FALSEIndia cs 6 FALSE 3 Yes No 0.8236297 P-REML FALSEKenya cr 4 FALSE 3 Yes Yes 1.251807 ML TRUENigeria cs 8 FALSE 2 Yes Yes 1.0010443 REML FALSESudan cr 4 FALSE 2 No No 1.967258 GACV.Cp TRUERural s. with electricityBangladesh cs 6 FALSE 3 Yes Yes 1.199093 GACV.Cp TRUEIndia cs 4 FALSE 3 Yes No 0.9344174 P-REML TRUEKenya cs 7 FALSE 3 Yes Yes 0.7277974 P-REML FALSENigeria cs 4 FALSE 3 Yes No 1.474995 P-ML FALSESudan cs 14 TRUE 3 No No 1.492431 REML TRUEUrban s. without electricityBangladesh cs 5 TRUE 3 Yes Yes 1.058473 P-REML FALSEIndia cs 2 FALSE 3 Yes No 0.8125455 GACV.Cp FALSEKenya cs 5 TRUE 3 Yes Yes 0.7114296 GCV.Cp FALSENigeria cs 6 TRUE 3 Yes Yes 1.1093769 P-REML FALSESudan cs 15 FALSE 3 No No 0.9578136 GACV.Cp FALSERural s. without electricityBangladesh cr 6 FALSE 3 Yes Yes 0.6979624 REML FALSEIndia cs 8 FALSE 3 Yes No 1.127840 P-REML TRUEKenya cs 7 FALSE 3 Yes Yes 0.6442111 REML FALSENigeria cs 5 FALSE 1 Yes Yes 0.3833242 P-ML FALSESudan cs 15 TRUE 3 No No 1.395701 P-REML TRUE
61
Appendix B. Parameters used in the configuration of the models
B.2.3 SVM configuration
Table B.2.3: Configurations of the code used to estimate SVM Raw models with interactions.
Balance/country Formula f(t, z) cost gamma epsilon kernelUrban s. with electricityBangladesh time*class 670.2429 0.06072633 0.0896873 radialIndia time*class 447.3538 0.06656046 0.0752192 radialKenya time*class 115.2261 0.006169162 0.009422912 radialNigeria time*class 959.0335 0.009250104 0.02365988 radialSudan time 676.0736 0.06562984 0.008478297 radialRural s. with electricityBangladesh time*class 662.8796 0.01353806 0.0238125 radialIndia time*class 924.5399 0.06264566 0.01549738 radialKenya time*class 89.90413 0.003396583 0.02539831 radialNigeria time*class 17.1627 0.03082246 0.907276 radialSudan time 737.6026 0.04106443 0.006594902 radialUrban s. without electricityBangladesh time*class 627.6893 0.1176374 0.01581494 radialIndia time*class 0.0724129 0.0724129 0.1487976 radialKenya time*class 101.7273 0.009708038 0.02240919 radialNigeria time*class 29.49511 0.06702711 0.08153087 radialSudan time 332.1218 0.02526219 0.002168561 radialRural s. without electricityBangladesh time*class 539.0089 0.0418369 0.008133287 radialIndia time*class 614.8501 0.06708856 0.0158196 radialKenya time*class 302.4322 0.002080184 0.02074777 radialNigeria time*class 521.5111 0.02427693 0.008321363 radialSudan time 889.0924 0.03564625 0.001016959 radial
62
Appendix C
R Code
This appendix displays the script related to the LM, SVM and GAM estimated models for Bangladesh and the code
used to reproduce the different plots showed in the Results chapter. To reproduce other country should be changed
the parameters of the different models (see appendix B) and change the country name. The data set of each country
# Make a list from the ... arguments and plotlistplots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use ’cols’ to determine layoutif (is.null(layout)) {
# Make the panel# ncol: Number of columns of plots# nrow: Number of rows needed, calculated from # of colslayout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))}
if (numPlots==1) {print(plots[[1]])
} else {# Set up the pagegrid.newpage()pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct locationfor (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
63
Appendix C. R Code
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
## ############################################################################################### Exercise d)### =========================================================================================== ### Figure 4.2.3: Comparison between single covariate model versus interaction model for ...### ------------------------------------------------------------------------------------------- ### #############################################################################################
# ========================================================================================# Figure 4.2.3: Comparison between single covariate model versus interaction model for ...# ----------------------------------------------------------------------------------------
## arreglo en visualizacion de los datosLM.CODA <- cbind(gather(CODA[CODA$Country.Name == pais, ], key = access, value=real, urban:norural),
fitted = gather(ilrinv, key = access, value = value, Var1:Var4)$value)rm(ilrinv)
# ======================================================================================# LM Interaction model outside the calibration range# --------------------------------------------------------------------------------------
# ======================================================================================# GAM Interaction model outside the calibration range# --------------------------------------------------------------------------------------
## arreglo en visualizacion de los datosSVM.CODA <- SVM.CODA[!(SVM.CODA$time < max(CODA$time[CODA$data == "base"]) & SVM.CODA$data=="fitted"),]rm(ilrinv)
# ======================================================================================# SVM Interaction model outside the calibration range# --------------------------------------------------------------------------------------
## ######################################################################## Exercise f) ### ==================================================================== ### 4.4. Consequences of using standard statistical techniques over ### compositional data in raw form ### -------------------------------------------------------------------- ##########################################################################
## =====================================================================## RAW MODELS.- LM## ---------------------------------------------------------------------
RAW <- CODA[CODA$Country.Name == pais, -7]
lm1R <- lm(urban ˜ time*class, data = RAW[RAW$data == "base", ])lm2R <- lm(nourban ˜ time*class, data = RAW[RAW$data == "base", ])lm3R <- lm(rural ˜ time, data = RAW[RAW$data == "base", ])lm4R <- lm(norural ˜ time, data = RAW[RAW$data == "base", ])
# ==============================================# Prediccion 1.- PREDICCIONES lm().- RAW MODELS# ----------------------------------------------
# ====================================================================# Figure 4.4.2: Cross validation between CoDa and Raw model inside the# calibration rank for Nigeria using the RMSE# --------------------------------------------------------------------
# =====================================================================# Figure 4.4.3: Cross validation between CoDa and Raw model outside the# calibration rank for Nigeria using the RMSE# ---------------------------------------------------------------------
hc2 <- ggplot(data = RMSE.T[RMSE.T$data == "fitted", ],aes(x = position, y = SE, fill = transformation)) +
geom_bar(stat = "identity", color = "black", position = position_dodge( ), size = 1.2) +scale_fill_manual(values = c(’#999999’, ’#E69F00’)) +labs(x = ’Interpolation method’, y = ’Root mean square error’, fill = "Transformation") +theme_minimal(base_size = 18)