Page 1
Hydrochemical assessment of Semarang area usingmultivariate statistics: A sample based datasetIrawan Dasapta Erwin1 and Putranto Thomas Triadi2
1Faculty of Earth Sciences and Technology, Institut Teknologi Bandung, Jalan Ganesa No. 10,Bandung - 40132, Indonesia2Faculty of Engineering, Universitas Diponegoro, Jalan Prof. H. Soedarto, SH, Tembalang, KotaSemarang - 50275, Indonesia
Correspondence to: Dasapta Erwin Irawan ([email protected] )
Abstract. The following paper describes in brief the data set related to our project "Hydrochemical
assessment of Semarang Groundwater Quality". All of 58 samples were taken in 1992, 1993, 2003,
2006, and 2007 using well point data from several reports from Ministry of Energy and Mineral
Resources and independent consultants. We provided 20 parameters in each samples (sample id,
coord X, coord Y, well depth, water level, water elevation, TDS, pH, EC, K, Ca, Na, Mg, Cl, SO4,5
HCO3, year, ion balance, screen location, and chemical facies). The chemical composition were
tested in the Water Quality Laboratory, Universitas Diponegoro using mas spectrofotometer method.
The statistical treatment for the dataset (available on Zenodo doi:10.5281/zenodo.57293) were
described as follows: (1) data preparation in to csv file format, load it in to R environment; (2)
data treatment, including: correlation matrix, cluster analysis using kmeans and hierarchical cluster10
analysis, and principal component analysis. For analysis and visualizations, We used the following
R packages: ggplot2, dplyr, factomineR, factoExtra, cluster, ggcorrplot, and
ape.
1 Introduction
The following paper describes in brief the data set related to our project "Hydrochemical assessment15
of Semarang Groundwater Quality". The aim of this project is to understand the water quality clas-
sification and distribution in Semarang area and to explain the underlying processes. This analysis
is very important with the vast development of infrastructure (Putranto and Rüde (2016)) and urban
settlement in coastal area and the rate of salinity encroachment (Rahmawati and Marfai (2013)). The
location of the study is Semarang area, Indonesia.20
1
Earth Syst. Sci. Data Discuss., doi:10.5194/essd-2016-29, 2016
Ope
n A
cces
s Earth System
Science
DataD
iscussio
ns
Manuscript under review for journal Earth Syst. Sci. DataPublished: 28 July 2016c© Author(s) 2016. CC-BY 3.0 License.
Page 2
2 General description of the dataset
2.1 Samples
All of 58 taken in 1992, 1993, 2003, 2006, and 2007 in 1992, 1993, 2003, 2006, and 2007 using well
point data from several reports from Ministry of Energy and Mineral Resources and independent
consultant. We provided 20 parameters in each samples: sample id, coord X, coord Y,25
well depth, water level, water elevation, TDS, pH, EC, K, Ca, Na, Mg,
Cl, SO4, HCO3, year, ion balance, screen location, and chemical facies.
The chemical composition were tested in the Water Quality Laboratory, Universitas Diponegoro us-
ing mass spectrofotometer method. The laboratory procedures followed the SNI (Indonesia National
Standard) for water quality measurement (BSN (2012)), which is comply to the US-EPA standards.30
The original dataset is available on Zenodo (Irawan and Putranto (2016)).
Figure 1. The location of well point and the Stiff diagram
2.2 The value of dataset
The following list describes the value of the dataset:
– It provides the current setting of water quality as the baseline of environmental monitoring of
the area and serves as a source of groundwater quality indicator for the regional planning of35
the area,
– It promotes the importance of open government dataset and enriches the library of water qual-
ity dataset of the area,
– It sets an example of data re-use and re-analysis in hydrogeological research landscape.
2
Earth Syst. Sci. Data Discuss., doi:10.5194/essd-2016-29, 2016
Ope
n A
cces
s Earth System
Science
DataD
iscussio
ns
Manuscript under review for journal Earth Syst. Sci. DataPublished: 28 July 2016c© Author(s) 2016. CC-BY 3.0 License.
Page 3
3 Geographical coverage40
The sampling area is Semarang area, the capital of Mid Java Province, Java, Indonesia. The sampling
points were distributed from the southern volcanic highland to the coastal area. The coordinate of the
area is (420000, 9240000) and (470000, 9220000). We plotted the data points using UTM-WGS84-
48S projection system.
4 Statistical design45
The hierarchical cluster analysis (HCA) and principal component analysis (PCA) are both widely
used in the hydrochemical analysis (Adams et al. (2001); King et al. (2014); Ayenew et al. (2009);
Deon et al. (2015); Wilkinson (2014); Maechler et al. (2016)). We have applied the two approaches
on groundwater in volcanic area on various locations (Irawan et al. (2009); Herdianita et al. (2010)).
The R implementation was based on Coghlan (2009).50
4.1 Data preparation
The dataset was formatted in the csv (comma separated value) before parsed in to R program (R
Core Team (2016)) for analysis using the following R packages: ggplot2 (Wickham (2009)), dplyr
(Wickham and Francois (2016)), factomineR (Lê et al. (2008)), factoExtra (Kassambara and Mundt
(2016)), cluster (Maechler et al. (2016)), ggcorrplot (Kassambara (2016)), and ape (Paradis et al.55
(2004)).
d f <− as . d a t a . f rame ( r e a d . csv ( " data_smg . csv " ) ) # l o a d i n g as d a t a f rame
head ( d f ) # c h e c k i n g h e a d e r
i s . na ( d f ) # c h e c k i n g NAs i n d f
df2 <− df [ c ( 2 , 5 : 1 8 ) ] # s u b s e t t i n g df , e x c l u d e v a r wi th NAs60
head ( df2 )
i s . na ( d f2 ) # c h e c k i n g NAs i n df2
s t r ( d f2 ) # c h e c k i n g d a t a t y p e i n df2
i s . numer ic ( d f2 ) # c h e c k i n g d a t a t y p e i n df2
rownames ( df2 ) <− d f 2 $ l o c a t i o n # s e t t i n g c o l l o c a t i o n as row names65
s t r ( d f2 ) # c h e c k i n g d a t a t y p e i n df2
4.2 Data treatment
The dataset was treated using the following method: correlation matrix, HCA, and PCA. the steps
and R code can be described below.
3
Earth Syst. Sci. Data Discuss., doi:10.5194/essd-2016-29, 2016
Ope
n A
cces
s Earth System
Science
DataD
iscussio
ns
Manuscript under review for journal Earth Syst. Sci. DataPublished: 28 July 2016c© Author(s) 2016. CC-BY 3.0 License.
Page 4
4.2.1 Correlation matrix70
Here we used PerformanceAnalytics and ggcorrplot packages to build a correlation ma-
trix. The following is the code.
## u s i n g P e r f o r m a n c e A n a l y t i c s
i n s t a l l . p a c k a g e s ( " P e r f o r m a n c e A n a l y t i c s " )
l i b r a r y ( P e r f o r m a n c e A n a l y t i c s )75
c h a r t . C o r r e l a t i o n ( df2 , h i s t o g r a m =TRUE, pch =19) # v i s u a l PA
## u s i n g g g c o r r p l o t
i n s t a l l . p a c k a g e s ( " g g c o r r p l o t " )
l i b r a r y ( g g c o r r p l o t )80
c o r r e l <− round ( c o r ( d f2 ) , 1 ) # r o u n d i n g c o r r e l m a t r i x
head ( c o r r e l [ , 1 : 1 4 ] ) # view h e a d e r s
p . mat <− cor_pmat ( d f2 ) # compute p−v a l u e s
head ( p . mat [ , 1 : 1 4 ] ) # view h e a d e r s
g g c o r r p l o t ( c o r r e l ) # making heatmap85
4.2.2 Hierarchical cluster analysis (CA)
We build the CA using k-means and hierarchical clustering by implementing R base function and
factoextra package, based on the following code.
i n s t a l l . p a c k a g e s ( " f a c t o e x t r a " )
# i n s t a l l _ g i t h u b ( " kas sambara / f a c t o e x t r a " )90
i n s t a l l . p a c k a g e s ( " c l u s t e r " )
l i b r a r y ( c l u s t e r )
l i b r a r y ( f a c t o e x t r a )
### k means method95
km2 <− kmeans ( df2 , 2 , n s t a r t = 25) # kmeans wi th 2 c e n t e r s
km3 <− kmeans ( df2 , 3 , n s t a r t = 25) # kmeans wi th 3 c e n t e r s
k m 2 $ c l u s t e r # e x t r a c t i n g c l u s t e r number
km2$cen t e r s # e x t r a c t i n g c l u s t e r means ( o r c e n t e r s )
p lo tkm2 <− p l o t ( df2 ,100
c o l = k m 2 $ c l u s t e r ,
pch = 19 ,
f rame = T ,
main = "K−means wi th k = 2 " ) # n o t e s : need l o n g e r x a x i s
4
Earth Syst. Sci. Data Discuss., doi:10.5194/essd-2016-29, 2016
Ope
n A
cces
s Earth System
Science
DataD
iscussio
ns
Manuscript under review for journal Earth Syst. Sci. DataPublished: 28 July 2016c© Author(s) 2016. CC-BY 3.0 License.
Page 5
p o i n t s ( km2$cen te r s ,105
c o l = 1 : 2 ,
pch = 8 , cex = 3)
k m 3 $ c l u s t e r # e x t r a c t i n g c l u s t e r number
km3$cen t e r s # e x t r a c t i n g c l u s t e r means ( o r c e n t e r s )110
plotkm3 <− p l o t ( df2 ,
c o l = k m 3 $ c l u s t e r ,
pch = 19 ,
f rame = T ,
main = "K−means wi th k = 3 " )115
p o i n t s ( km3$cen te r s ,
c o l = 1 : 2 ,
pch = 8 ,
cex = 3)
120
### e v a l u a t i n g c l u s t e r
d f2 <− s c a l e ( d f2 )
head ( df2 )
f v i z _ n b c l u s t ( df2 ,
kmeans , method = " wss " ) +125
geom_vl ine ( x i n t e r c e p t = 3 ,
l i n e t y p e = 2) # d e t e r m i n i n g o p t i m a l no c l u s t e r
km3 . r e s <− kmeans ( df2 , 3 , n s t a r t = 25) # r u n n i n g kmeans wi th 4 c l u s t e r
p r i n t ( km3 . r e s ) # p r i n t o u t p u t
f v i z _ c l u s t e r ( km3 . r e s , d a t a = df2 ) # v i s o u t p u t130
pam . r e s <− pam ( s c a l e ( d f2 ) , 3 ) # r u n n i n g pam c l u s t e r w i th 3 c l u s t e r
pam . r e s $ m e d o i d s # e x t r a c t medoids
c l u s p l o t ( pam . r e s ,
main = " C l u s t e r p l o t , k = 3 " ,135
c o l o r = TRUE)
p l o t ( s i l h o u e t t e ( pam . r e s ) , c o l = 2 : 5 )
f v i z _ s i l h o u e t t e ( s i l h o u e t t e ( pam . r e s ) )
c l a r a x <− c l a r a ( df2 , 3 , s ample s = 5) # u s i n g c l a r a method
f v i z _ c l u s t e r ( c l a r a x ,140
s t a n d = FALSE ,
5
Earth Syst. Sci. Data Discuss., doi:10.5194/essd-2016-29, 2016
Ope
n A
cces
s Earth System
Science
DataD
iscussio
ns
Manuscript under review for journal Earth Syst. Sci. DataPublished: 28 July 2016c© Author(s) 2016. CC-BY 3.0 License.
Page 6
geom = " p o i n t " ,
l a b e l =T ,
p o i n t s i z e = 1)
145
### C r e a t i n g dendogram
d i s t d f 2 . r e s <− d i s t ( df2 ,
method = " e u c l i d e a n " )
h ca d f 2 <− h c l u s t ( d i s t d f 2 . r e s ,
method = " c o m p l e t e " )150
p l o t ( hcadf2 ,
hang = −1) # dendogram v i s
r e c t . h c l u s t ( hcadf2 ,
k = 3 ,
b o r d e r = 2 : 4 ) # dendogram v i s wi th g r o u p i n g155
### u s i n g n b c l u s t pack t o e v a l u a t e no of c l u s t e r
i n s t a l l . p a c k a g e s ( " NbClus t " ) # f o r more p r e c i s e no of c l u s t e r
l i b r a r y ( " NbClus t " )
r e s d f 2 . nb <− NbClus t ( df2 ,160
d i s t a n c e = " e u c l i d e a n " ,
min . nc = 2 , max . nc = 10 ,
method = " c o m p l e t e " ,
i n d e x =" gap " )
r e s d f 2 . nb # p r i n t t h e r e s u l t s165
r e s d f 2 . nb$Al l . i n d e x # A l l gap s t a t i s t i c v a l u e s
r e s d f 2 . nb$Bes t . nc # B es t number o f c l u s t e r s
r e s d f 2 . nb$Bes t . p a r t i t i o n # c a l c u l a t e b e s t p a r t i t i o n
nbdf2 <− NbClus t ( df2 ,
d i s t a n c e = " e u c l i d e a n " ,170
min . nc = 2 ,
max . nc = 10 ,
method = " c o m p l e t e " ,
i n d e x =" a l l " )
nbdf2175
f v i z _ n b c l u s t ( nbdf2 ) + theme_minimal ( )
dev . o f f ( ) # d e l e t e t h e ’# ’ s i g n whenever
# you want t o c l e a n t h e p l o t s c r e e n
6
Earth Syst. Sci. Data Discuss., doi:10.5194/essd-2016-29, 2016
Ope
n A
cces
s Earth System
Science
DataD
iscussio
ns
Manuscript under review for journal Earth Syst. Sci. DataPublished: 28 July 2016c© Author(s) 2016. CC-BY 3.0 License.
Page 7
d i s t d f 2 . r e s <− d i s t ( df2 ,
method = " e u c l i d e a n " )180
h ca d f 2 <− h c l u s t ( d i s t d f 2 . r e s ,
method = " c o m p l e t e " )
p l o t ( hcadf2 ,
hang = −1) # dendogram v i s
r e c t . h c l u s t ( hcadf2 ,185
k = 3 ,
b o r d e r = 2 : 4 ) # dendogram v i s wi th g r o u p i n g
#### r o t a t i n g t h e p l o t
190
#### u s i n g ape
# l o a d package ape ; remember t o i n s t a l l i t : i n s t a l l . p a c k a g e s ( ’ ape ’ )
i n s t a l l . p a c k a g e s ( " ape " )
l i b r a r y ( ape )
p l o t ( a s . phy lo ( h c a d f 2 ) ,195
cex = 0 . 9 ,
l a b e l . o f f s e t = 1 ,
t y p e = " u n r o o t e d " )
p l o t ( a s . phy lo ( h c a d f 2 ) ,200
cex = 0 . 9 ,
l a b e l . o f f s e t = 1 )
4.2.3 Principal component analysis (PCA)
The PCA is applied using R base function and visualized using factominer and factoextra
packages. The following is the code.205
df <− as . d a t a . f rame ( r e a d . csv ( " data_smg . csv " ) ) # l o a d i n g as d a t a f rame
head ( d f ) # c h e c k i n g h e a d e r
i s . na ( d f ) # c h e c k i n g NAs i n d f
df2 <− df [ c ( 2 , 5 : 1 8 ) ] # s u b s e t t i n g df , e x c l u d e v a r wi th NAs
head ( df2 )210
i s . na ( d f2 ) # c h e c k i n g NAs i n df2
s t r ( d f2 ) # c h e c k i n g d a t a t y p e i n df2
i s . numer ic ( d f2 ) # c h e c k i n g d a t a t y p e i n df2
7
Earth Syst. Sci. Data Discuss., doi:10.5194/essd-2016-29, 2016
Ope
n A
cces
s Earth System
Science
DataD
iscussio
ns
Manuscript under review for journal Earth Syst. Sci. DataPublished: 28 July 2016c© Author(s) 2016. CC-BY 3.0 License.
Page 8
rownames ( df2 ) <− d f 2 $ l o c a t i o n # s e t t i n g c o l l o c a t i o n as row names
s t r ( d f2 ) # c h e c k i n g d a t a t y p e i n df2215
i n s t a l l . p a c k a g e s ( " FactoMineR " )
l i b r a r y ( " FactoMineR " )
l i b r a r y ( f a c t o e x t r a )
r e s . pca <− PCA( df2 , g raph = FALSE)220
e i g e n v a l u e s <− r e s . p c a $ e i g
head ( e i g e n v a l u e s [ , 1 : 2 ] )
b a r p l o t ( e i g e n v a l u e s [ , 2 ] , names . a r g =1: nrow ( e i g e n v a l u e s ) ,
main = " V a r i a n c e s " ,
x l a b = " P r i n c i p a l Components " ,225
y l a b = " P e r c e n t a g e o f v a r i a n c e s " ,
c o l =" s t e e l b l u e " )
# Add c o n n e c t e d l i n e segmen t s t o t h e p l o t
l i n e s ( x = 1 : nrow ( e i g e n v a l u e s ) , e i g e n v a l u e s [ , 2 ] ,
t y p e =" b " , pch =19 , c o l = " r e d " )230
r e s . p c a $ v a r $ c o n t r i b
f v i z _ p c a _ v a r ( r e s . pca )
f v i z _ p c a _ v a r ( r e s . pca , c o l . v a r =" s t e e l b l u e " )+
theme_minimal ( )235
r e s . p c a $ i n d $ c o n t r i b
p l o t ( r e s . pca , c h o i x = " i n d " )
f v i z _ p c a _ b i p l o t ( r e s . pca , geom = " t e x t " )240
5 Conclusions
The present study integrates geological, hydrogeological data, and statistical analysis to construct
a hydrogeological model of the aquifer system in Semarang. The statistical treatment shows a con-
sistent pattern of anomalous setting at well point 37 (University Sultan Agung 2/Unisula-2). The
anomaly needs more in depth analysis to understand the underlying processes in the groundwater245
flow.
This paper is one of our preliminary example of data paper in Indonesia. Hopefully this can trigger
more data papers to endorse open science in our country.
8
Earth Syst. Sci. Data Discuss., doi:10.5194/essd-2016-29, 2016
Ope
n A
cces
s Earth System
Science
DataD
iscussio
ns
Manuscript under review for journal Earth Syst. Sci. DataPublished: 28 July 2016c© Author(s) 2016. CC-BY 3.0 License.
Page 9
Acknowledgements. The authors are thankful to the Department of Energy and Resources of Central Java
Province and Geological Agency in Bandung for providing hydrogeological data. Hopefully this paper will250
initiate a mass movement on open government data and data reuse in Indonesia.
9
Earth Syst. Sci. Data Discuss., doi:10.5194/essd-2016-29, 2016
Ope
n A
cces
s Earth System
Science
DataD
iscussio
ns
Manuscript under review for journal Earth Syst. Sci. DataPublished: 28 July 2016c© Author(s) 2016. CC-BY 3.0 License.
Page 10
References
Adams, S., Titus, R., Pietersen, K., Tredoux, G., and Harris, C.: Hydrochemical characteristics of aquifers near
Sutherland in the Western Karoo, South Africa, Journal of Hydrology, 241, 91–103, 2001.
Ayenew, T., Fikre, S., Wisotzky, F., Demlie, M., and Wohnlich, S.: Hierarchical cluster analysis of hydrochem-255
ical data as a tool for assessing the evolution and dynamics of groundwater across the Ethiopian rift, Interna-
tional journal of physical sciences, 4, 76–90, http://www.academicjournals.org/journal/IJPS/article-abstract/
D64DFAE18634, 2009.
BSN: Standard tests for water sample, Tech. rep., National Board for Standards, http://sisni.bsn.go.id/index.
php/sni_main/sni/detail_sni/7689, 2012.260
Coghlan, A.: Little Book of R for Multivariate Analysis! — Multivariate Analysis 0.1 documentation, Wellcome
Trust Sanger Institute, Cambridge, U.K., https://little-book-of-r-for-multivariate-analysis.readthedocs.io/en/
latest/, affiliation: Wellcome Trust Sanger Institute, Cambridge, U.K., 2009.
Deon, F., Förster, H.-J., Brehme, M., Wiegand, B., Scheytt, T., Moeck, I., Jaya, M., and Putriatni, D.: Geochem-
ical/hydrochemical evaluation of the geothermal potential of the Lamongan volcanic field (Eastern Java,265
Indonesia), Geothermal Energy, 3, 1–21, doi:10.1186/s40517-015-0040-6, 2015.
Herdianita, N. R., Julinawati, T., and Amorita, I. E.: Hydrogeochemistry of Thermal Water from Surface
Manifestation at Gunung Ciremai and Its Surrounding, Cirebon, West Java–Indonesia, in: Proceedings
World Geothermal Congress 2010, http://www.geothermal-energy.org/pdf/IGAstandard/WGC/2010/1476.
pdf, 2010.270
Irawan, D. E. and Putranto, T. A.: Dataset: hydrochemical assessment of Semarang area, Indonesia,
doi:10.5281/zenodo.57293, http://dx.doi.org/10.5281/zenodo.57293, 2016.
Irawan, D. E., Puradimaja, D. J., Notosiswoyo, S., and Soemintadiredja, P.: Hydrogeochemistry of volcanic
hydrogeology based on cluster analysis of Mount Ciremai, West Java, Indonesia, Journal of hydrology, 376,
221–234, http://www.sciencedirect.com/science/article/pii/S002216940900434X, 2009.275
Kassambara, A.: ggcorrplot: Visualization of a Correlation Matrix using ’ggplot2’, https://CRAN.R-project.
org/package=ggcorrplot, r package version 0.1.1, 2016.
Kassambara, A. and Mundt, F.: factoextra: Extract and Visualize the Results of Multivariate Data Analyses,
https://CRAN.R-project.org/package=factoextra, r package version 1.0.3, 2016.
King, A. C., Raiber, M., and Cox, M. E.: Multivariate statistical analysis of hydrochemical data to assess280
alluvial aquifer–stream connectivity during drought and flood: Cressbrook Creek, southeast Queensland,
Australia, Hydrogeology Journal, 22, 481–500, doi:10.1007/s10040-013-1057-1, http://link.springer.com/
10.1007/s10040-013-1057-1, 2014.
Lê, S., Josse, J., and Husson, F.: FactoMineR: A Package for Multivariate Analysis, Journal of Statistical Soft-
ware, 25, 1–18, doi:10.18637/jss.v025.i01, 2008.285
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., and Hornik, K.: cluster: Cluster Analysis Basics and
Extensions, r package version 2.0.4 — For new features, see the ’Changelog’ file (in the package source),
2016.
Paradis, E., Claude, J., and Strimmer, K.: APE: analyses of phylogenetics and evolution in R language, Bioin-
formatics, 20, 289–290, 2004.290
10
Earth Syst. Sci. Data Discuss., doi:10.5194/essd-2016-29, 2016
Ope
n A
cces
s Earth System
Science
DataD
iscussio
ns
Manuscript under review for journal Earth Syst. Sci. DataPublished: 28 July 2016c© Author(s) 2016. CC-BY 3.0 License.
Page 11
Putranto, T. and Rüde, T.: Hydrogeological Model of an Urban City in a Coastal Area, Case study: Se-
marang, Indonesia, Indonesian Journal on Geoscience, 3, 17–27, https://ijog.geologi.esdm.go.id/index.php/
IJOG/article/view/227, 2016.
R Core Team: R: A Language and Environment for Statistical Computing, R Foundation for Statistical Com-
puting, Vienna, Austria, https://www.R-project.org/, 2016.295
Rahmawati, N. and Marfai, M.: Salinity Pattern in Semarang Coastal City: An Overview, Indonesian Jour-
nal of Geosciences, 8, doi:10.17014/ijog.3.1.17-27, https://ijog.geologi.esdm.go.id/index.php/IJOG/article/
view/160, 2013.
Wickham, H.: ggplot2: Elegant Graphics for Data Analysis, Springer-Verlag New York, http://ggplot2.org,
2009.300
Wickham, H. and Francois, R.: dplyr: A Grammar of Data Manipulation, https://CRAN.R-project.org/package=
dplyr, r package version 0.5.0, 2016.
Wilkinson, D. J.: Multivariate Data Analysis using R: a course notes, Tech. rep., https://www.staff.ncl.ac.uk/d.
j.wilkinson/teaching/mas8381/notes14.pdf, 2014.
11
Earth Syst. Sci. Data Discuss., doi:10.5194/essd-2016-29, 2016
Ope
n A
cces
s Earth System
Science
DataD
iscussio
ns
Manuscript under review for journal Earth Syst. Sci. DataPublished: 28 July 2016c© Author(s) 2016. CC-BY 3.0 License.