UNIVERSITATIS OULUENSIS ACTA C TECHNICA OULU 2011 C 403 Teemu Räsänen INTELLIGENT INFORMATION SERVICES IN ENVIRONMENTAL APPLICATIONS UNIVERSITY OF OULU, FACULTY OF TECHNOLOGY, DEPARTMENT OF PROCESS AND ENVIRONMENTAL ENGINEERING C 403 ACTA Teemu Räsänen
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ABCDEFG
UNIVERS ITY OF OULU P.O.B . 7500 F I -90014 UNIVERS ITY OF OULU F INLAND
A C T A U N I V E R S I T A T I S O U L U E N S I S
S E R I E S E D I T O R S
SCIENTIAE RERUM NATURALIUM
HUMANIORA
TECHNICA
MEDICA
SCIENTIAE RERUM SOCIALIUM
SCRIPTA ACADEMICA
OECONOMICA
EDITOR IN CHIEF
PUBLICATIONS EDITOR
Senior Assistant Jorma Arhippainen
Lecturer Santeri Palviainen
Professor Hannu Heusala
Professor Olli Vuolteenaho
Senior Researcher Eila Estola
Director Sinikka Eskelinen
Professor Jari Juga
Professor Olli Vuolteenaho
Publications Editor Kirsti Nurkkala
ISBN 978-951-42-9656-7 (Paperback)ISBN 978-951-42-9657-4 (PDF)ISSN 0355-3213 (Print)ISSN 1796-2226 (Online)
U N I V E R S I TAT I S O U L U E N S I SACTAC
TECHNICA
U N I V E R S I TAT I S O U L U E N S I SACTAC
TECHNICA
OULU 2011
C 403
Teemu Räsänen
INTELLIGENT INFORMATION SERVICES IN ENVIRONMENTAL APPLICATIONS
UNIVERSITY OF OULU,FACULTY OF TECHNOLOGY,DEPARTMENT OF PROCESS AND ENVIRONMENTAL ENGINEERING
C 403
ACTA
Teemu R
äsänen
C403etukansi.kesken..fm Page 1 Thursday, November 3, 2011 2:46 PM
A C T A U N I V E R S I T A T I S O U L U E N S I SC Te c h n i c a 4 0 3
TEEMU RÄSÄNEN
INTELLIGENT INFORMATION SERVICES IN ENVIRONMENTAL APPLICATIONS
Academic dissertation to be presented with the assent ofthe Faculty of Technology of the University of Oulu forpublic defence in Auditorium PR102, Linnanmaa, on 2December 2011, at 12 noon
Supervised byProfessor Kauko LeiviskäProfessor Mikko KolehmainenProfessor Juhani Ruuskanen
Reviewed byProfessor Kari KoskinenProfessor Tommi Kärkkäinen
ISBN 978-951-42-9656-7 (Paperback)ISBN 978-951-42-9657-4 (PDF)
ISSN 0355-3213 (Printed)ISSN 1796-2226 (Online)
Cover DesignRaimo Ahonen
JUVENES PRINTTAMPERE 2011
Räsänen, Teemu, Intelligent information services in environmental applications University of Oulu, Faculty of Technology, Department of Process and Environmental Engineering,P.O. Box 4300, FI-90014 University of Oulu, FinlandActa Univ. Oul. C 403, 2011Oulu, Finland
AbstractThe amount of information available has increased due to the development of our modern digitalsociety. This has caused an information overflow, meaning that there is lot of data available but themeaningful information or knowledge is hidden inside the overwhelming data smog. Nevertheless, thelarge amount of data together with the increased capabilities of computers provides a great opportunityto learn the behaviour of different kinds of phenomena at a more detailed level.
The quality of life, well-being and a healthy living environment, for example, are fields where newinformation services can assist the creation of proactive decisions to avoid environmental problemscaused by industrial activity, traffic, or extraordinary weather conditions. The combination of datacoming from different sources such as public registers, companies’ operational information systems,online sensors and process monitoring systems provides a fruitful basis for creating new valuableinformation for citizens, decision makers or other end users.
The aim of this thesis is to present the concept of intelligent information services and amethodological background in order to add intelligence using computational methods for theenrichment of multidimensional data. Moreover, novel examples are presented where new significantinformation is created and then provided for end users. The data refining process used is called datamining and contains methods for data collection, pre-processing, modelling, visualizing andinterpreting the results and sharing the new information thus created.
Information systems are a base for the creation of information services, meaning that stakeholdergroups have access only to information but they do not own the whole information system that containsmeasurement systems, data collecting, and a technological platform. Intelligence in informationservices comes from the use of computational intelligent methods in data processing, modelling andvisualization. In this thesis the general concept of such services is presented and concretized using fivecases that focus on environmental and industrial examples.
The results of these case studies show that the combination of different data sources provides fertileground for developing new information services. The data mining methods used such as clustering andpredictive modelling together with effective pre-processing methods have great potential to handle thelarge amount of multivariate data in this environmental context also. A self-organizing map combinedwith k-means clustering is useful for creating more detailed information about personal energy use.Predictive modelling using a multilayer perceptron (MLP) is well suited for estimating the number oftourists visiting a leisure centre and to find the correspondence between pulp process characteristicsand the chemicals used. These results have many indirect effects on reducing negative concernsregarding our surroundings and maintaining a healthy living environment.
The innovative use of stored data is one of the main elements in the creation of future informationservices. Thus, more emphasis should be placed on the development of data integration and effectivedata processing methods. Furthermore, it is noted that final end users, such as citizens or decisionmakers, should be involved in the data refining process at the very first stage. In this way, the approachis truly customer-oriented and the results fulfil the concrete need of specific end users.
Keywords: data integration, data mining, electricity, environmental informatics, intelligentinformation services, k-means, multilayer perceptron, pulp, self-organizing map
Räsänen, Teemu, Älykkäät informaatiopalvelut ympäristöalan sovelluksissa Oulun yliopisto, Teknillinen tiedekunta, Prosessi- ja ympäristötekniikan osasto, PL 4300, 90014Oulun yliopistoActa Univ. Oul. C 403, 2011Oulu
TiivistelmäInformaation määrä on kasvanut merkittävästi tietoyhteiskunnan kehittymisen myötä. Käytös-sämme onkin huomattava määrä erimuotoista tietoa, josta voimme hyödyntää kuitenkin vainosan. Jatkuvasti mitattavan datan suuri määrä ja sijoittuminen hajalleen asettavat osaltaan haas-teita tiedon hyödyntämiselle. Tietoyhteiskunnassa hyvinvointi ja terveellisen elinympäristön säi-lyminen koetaan aiempaa tärkeämmäksi. Toisaalta yritysten toiminnan tehostaminen ja kestävänkehityksen edistäminen vaativat jatkuvaa parantamista. Informaatioteknologian avulla moniulot-teista mittaus- ja rekisteritietoa voidaan hyödyntää esimerkiksi ennakoivaan päätöksentekoonjolla voidaan edistää edellä mainittuja tavoitteita.
Tässä työssä on esitetty ympäristöalan älykkäiden informaatiopalveluiden konsepti, jossaoleellista on loppukäyttäjien tarpeiden tunnistaminen ja ongelmien ratkaiseminen jalostetuninformaation avulla. Älykkäiden informaatiopalvelujen taustalla on yhtenäinen tiedonlouhintaanperustuva tiedonjalostusprosessi, jossa raakatieto jalostetaan loppukäyttäjille soveltuvaan muo-toon. Tiedonjalostusprosessi koostuu datan keräämisestä ja esikäsittelystä, mallintamisesta, tie-don visualisoinnista, tulosten tulkitsemisesta sekä oleellisen tiedon jakamisesta loppukäyttäjä-ryhmille. Datan käsittelyyn ja analysointiin on käytetty laskennallisesti älykkäitä menetelmiä,josta juontuu työn otsikko; älykkäät informaatiopalvelut.
Väitöskirja pohjautuu viiteen artikkeliin, joissa osoitetaan tiedonjalostusprosessin toimivuuserilaisissa tapauksissa ja esitetään esimerkkejä kuhunkin prosessin vaiheeseen soveltuvista las-kennallisista menetelmistä. Artikkeleissa on kuvattu matkailualueen kävijämäärien ennakointiinja kotitalouksien sähköenergian kulutuksen pienentämiseen liittyvät informaatiopalvelut sekäanalyysi selluprosessissa käytettävien kemikaalien määrän pienentämiseksi. Näistä saadut koke-mukset ja tulokset on yleistetty älykkään informaatiopalvelun konseptiksi.
Väitöskirjan toisena tavoitteena on rohkaista organisaatioita hyödyntämään tietovarantojaaiempaa tehokkaammin ja monipuolisemmin sekä rohkaista tarkastelemaan myös oman organi-saation ulkopuolelta saatavien tietolähteiden käyttämistä. Toisaalta, uudenlaisten informaatiopal-velujen ja liiketoimintojen kehittämistä tukisi julkisilla varoilla kerättyjen, ja osin yritysten hal-lussa olevien, tietovarantojen julkaiseminen avoimiksi.
My educational journey has followed a little bit different roots than usually and
culminates now by the completion of this thesis. First I graduated as a
Telecommunications Mechanic from Pohjois-Savo Vocational School and then
received degree of Bachelor of Science in Environmental Engineering from
Savonia University of Applied Sciences. After these first educational milestones I
met Prof. Mikko Kolehmainen who introduced me to the interesting field of
environmental informatics. The combination of computer science and
environmental technology strike me like a thunderbolt and after three years I
received Master of Science in Environmental Informatics from University of Oulu.
I’m still following the same path, but I can say that I never believed that I could
receive a degree of Doctor of Technology. Well, miracles happen all the time.
However, I was lucky because I found this multidisciplinary substance which I
loved and also research “team” which made the individuals much better and
skilful by working as a team.
This research has been carried out at Research Group of Environmental
Informatics (University of Eastern Finland) which have been very inspiring
working environment. It has been honour to be part of this group and have
possibility to grow and learn things together with the group and my colleagues. I
also wish to express my gratitude to personnel of Department of Environmental
Science (UEF), especially to Ms. Marja-Leena Patronen, who has been invaluable
help concerning the project management.
I am deeply grateful to my supervisors, Prof. Kauko Leiviskä (University of
Oulu), Prof. Mikko Kolehmainen and Prof. Juhani Ruuskanen (University of
Eastern Finland). Mikko introduced me to an interesting area of computational
methods and data mining and sent me to this journey, whereas Kauko made me to
reach the destination and finally finish the work. Juhani was always available
whenever I needed guidance concerning environmentally related issues or writing
of scientific articles. The quality of this thesis improved because of positive and
encouraging comments of the pre-examiners Prof. Kari Koskinen (Aalto
University) and Prof. Tommi Kärkkäinen (University of Jyväskylä). Thank you
for reviewing this thesis. I would like to thank also Mr. Mike Jones from PELC
Inc. for the proofreading of manuscript.
I would like to thank my co-authors M.Sc. Harri Niska, Jarkko Tiirikainen,
Teri Hiltunen, M.Sc. Dimitris Voukantsis, Prof. Kostas Karatzas, Dr. Risto
Soukka, M.Sc. Sami Kokki, Dr. Yrjö Hiltunen, Prof. Mikko Kolehmainen and
8
Prof. Juhani Ruuskanen. You have shared the painful writing process of articles. It
is true that without good team the scoring goals is not possible.
This study has been financially supported by Tekes (The Finnish Funding
Agency for Technology and Innovation) and corporates who have participated to
our research projects. I would like to thank Finnet-liitto ry., DNA Finland Oy,
Tahkovuori Oy, Finnish Road Administration, Tahko Chalet Oy, City of Nilsiä,
Enfo Oy, Fortum Markets Oy, Savon Voima Verkko Oyj, Andritz Oy, Foster
Wheeler Oy, Honeywell Oy and Varenso Oy for co-operation and financial
support. The co-operation has been invaluable in searching for understanding of
the challenges that the real world sets for data analysis.
Preparing this thesis has been nice journey containing moments of joy and
enthusiasm but sometimes there was also some dark clouds shadowing my work.
The support of my dear wife Johanna and the joy that our lively children Arttu,
Eemil and Veikka have brought have been the really important facilitators for this
work. I would like to thank my parents Eila and Veikko. I also thank Tero, Seppo,
Mirja and my few friends. You all have encouraged and supported me all the way
with this dream that has come true.
Lanzarote, 25th of October 2011 Teemu Räsänen
9
Abbreviations
API Application Program Interface
BMU SOM algorithms Best-Matching Unit
DB Davies-Bouldin Index
FFT Fast Fourier Transform
GSM Global System for Mobile communications
GRPS General Packet Radio Service
IA Index-of-Agreement
KDD Knowledge Discovery from Data
LCP Life Cycle Profit
LS Least Squares
MLE Maximum Lyapunov Exponent
MLP Multilayer Perceptron
MMEA Monitoring, Measuring and Environmental Assessment consortium
MySQL Open source relation database management system
NaOH Sodium Hydroxide
NLS National Land Survey of Finland
OLTP Online Transaction Processing
OGC Open Geospatial Consortium
QE Self-Organizing Map Quantisation Error
R2 The Coefficient of Determination
RMSE Root Mean Square Error
SMS Short Message Service
SOA Service Oriented Architecture
SOAP Simple Object Access Protocol
SOM Self-Organizing Map
SWE Sensor Web Enablement
WWW World-Wide Web
XML Extensible Markup Language
10
11
List of original publications
This thesis consists of an introductory part and the following five peer-reviewed
original publications.
I Räsänen T, Niska H, Hiltunen T, Tiirikainen J & Kolehmainen M (2009) Predictive system for monitoring regional visitor attendance levels in large recreational areas. Journal of Environmental Informatics 13: 45–55.
II Räsänen T, Ruuskanen J & Kolehmainen M (2008) Reducing energy consumption by using self-organizing maps to create more personalized electricity use information. Applied Energy 85: 830–840.
III Räsänen T, Voukantsis D, Niska H, Karatzas K & Kolehmainen M (2010) Data-based method for creating electricity use load profiles using large amount of customer-specific hourly measured electricity use data. Applied Energy 87: 3538–3545.
IV Räsänen T & Kolehmainen M (2009) Feature-based clustering for electricity use time series data. ICANNGA 2009. Lecture Notes of Computer Science 5495: 401–412.
V Räsänen T, Soukka R, Kokki S & Hiltunen Y (2008) Neural networks in process life cycle profit modelling. Expert Systems with Applications 35: 604–610.
12
13
Contents
Abstract
Tiivistelmä
Acknowledgements 7 Abbreviations 9 List of original publications 11 Contents 13 1 Introduction 15
1.1 Background ............................................................................................. 15 1.2 Information systems and services ........................................................... 17 1.3 Computational intelligence ..................................................................... 18 1.4 Data mining in environmental applications ............................................ 19 1.5 Aims of the thesis .................................................................................... 21 1.6 The author’s contribution ........................................................................ 22 1.7 The structure of the thesis ....................................................................... 23
2 Data integration and information sources 25 2.1 Data integration ....................................................................................... 25 2.2 Information sources ................................................................................. 27
2.2.1 Mobile telecommunications data .................................................. 27 2.2.2 Population information system ..................................................... 28 2.2.3 Corporate operative data ............................................................... 29 2.2.4 Weather data ................................................................................. 30 2.2.5 Other environmental data sources ................................................ 30
3 Data pre-processing and computational methods 33 3.1 Pre-processing of data ............................................................................. 33 3.2 Feature extraction .................................................................................... 34 3.3 Variable selection .................................................................................... 35 3.4 Clustering ................................................................................................ 36
3.5 Predictive modelling ............................................................................... 40 3.6 Validation of models and results ............................................................. 42
3.6.1 Cross-validation ............................................................................ 43 3.6.2 Estimating the “goodness” of clustering ...................................... 44 3.6.3 Expert opinions as a validation method ........................................ 46
4 Results of the data mining tasks 47
14
4.1 Problem identification ............................................................................. 47 4.2 Data acquisition ....................................................................................... 48 4.3 Pre-processing of data ............................................................................. 50 4.4 Modelling and data analysis .................................................................... 51 4.5 Interpretation of results ........................................................................... 57 4.6 Knowledge deployment .......................................................................... 63
5 Discussion 65 5.1 The concept of intelligent information services ...................................... 67 5.2 Innovative use of data ............................................................................. 68 5.3 Commercializing intelligent information services .................................. 71
6 Conclusions 75 References 77 Original publications 83
15
1 Introduction
1.1 Background
We are living in an information society where the creation, distribution, use,
integration and manipulation of information are crucial for economic, political,
and cultural activity. In this context, information technology plays a very
significant role enabling access to the available information and services, which
improves the quality of life. The information society is defined as a creative
society that is based on interaction in that process. New technology has a
significant role but most important is the new way of doing things (Castells &
Himanen 2002).
There is also a relationship between sustainable development and the
information society. The effective use of information is seen as vital in the
creation of a path towards sustainable development. Bringing together
environmental issues and the information society is challenging but it would
result in a new kind of understanding about the overall framework (Välimäki
2002). Within this context, the overwhelming amount of information places
demands on the integration of computer science, environmental science and new
technological approaches in order to promote sustainable development (Pillmann,
Geiger & Voigt 2006). The development of the information society has created an
environment full of potential for utilizing new technology, such as computer
science, in sustainable development (Välimäki 2002).
This modern society has been confronted with a data overload, which makes
it difficult to see the significant issues or validity behind the data. Problems
relating to the quantity and diversity of data have always been attached to
decision making and the efficient use of available information (Bawden &
Robinson, 2009). Moreover, information is moving faster and becoming more
plentiful and people, enterprises and business are benefiting from this change.
Unfortunately, at a certain level, the glut of information no longer adds to our
quality of life but instead begins to promote stress, confusion and even ignorance.
Furthermore, this information overload aggravates the ability to see the critical
distinction between information and understanding (Shenk 2003).
The main reasons behind the information overload are the increasing rate of
new information being produced and the ease of transmission of data via the
Internet. Furthermore, there are many different channels of incoming information,
16
like online measuring systems, telephone, e-mail, instant messaging and really
simple syndication (rss) news feeders. As a result of this, there are large amounts
of complex data to handle and in the worst case a lot of contradiction,
inaccuracies or even missing values are included. As well, the data could face the
problem of a low signal-to-noise ratio. Due to these facts, there is a need for
methods to compare and process the different kinds of information and find the
most significant issues from all the available data (Shenk 2003).
Another important issue in using data and the creation of new information
services is data availability. The technology of information systems should
provide flexible possibilities to access data. Besides, data should be available for
free. At least, all the data whose collection has been funded using public taxes
should be freely available. For example, providing public access to environmental
information is a relatively new approach to environmental management that can
improve environmental decision making and pollution control (Fugui, Bing &
Bing 2008).
Quality of life, well-being, and a healthy living environment are one of the
fields where new information services are needed. Monitoring different kinds of
operational environments is another emergent field where information services
that contain ambient intelligence play a major role (Crowley et al. 2006).
Moreover, there are similar needs in the industrial sector where one of the main
goals is not only to get maximum income but also to create more eco-efficient
processes (Salmi 2007). The combination of data coming from different sources
like public registers, companies’ operational information systems, online sensors
and process monitoring systems provides a fruitful basis for creating new
valuable information for citizens, companies, decision makers or other end users.
17
Fig. 1. Intelligent information systems require three main elements: 1) a clear
definition of the end user’s problem, 2) data describing the phenomena behind the
problem and 3) the existing information management system.
However, integrating different kinds of data sources provides a fruitful starting
point for finding out more detailed information that describes the behaviour of the
phenomena. Furthermore, it provides a good basis for creating intelligent
information services, which deliver advanced solutions for the information needs
of the end user. The main aim of these services is to provide end users with easy
access to the desired information without the burden of mastering the technology
or information systems. In Figure 1, the main elements of successful information
services are clarified as the background of this study. These are namely: 1) a clear
definition of the end user’s problem, 2) data describing the phenomena behind the
problem and 3) the existing information management system. This thesis presents
the concept of intelligent information services, the methods behind them and
concrete environmental and industrial examples of such services.
1.2 Information systems and services
Information systems deal with the planning, development, management and use
of information technology tools to help people perform all tasks related to
information processing and management. These systems make it possible to
organize data so that it has a meaning and value to the recipient (Rainer & Turban
2008). Nowadays, such systems are widely used in everyday life, industry and
18
business. For example, environmental information systems are concerned with the
management of data about soil, water, air and the species in the world around us.
The collection and administration of such data is an essential component of any
efficient environmental protection strategy or solving the correspondence between
nature and industrial activities. One of the main aims of environmental
information systems is to respond to a major demand by offering available
environmental information and giving access to it for citizens, companies,
decision makers and other end users (Günther 1998).
It is possible that an innovative use of information systems will radically
change the way a firm or authority conducts their business and even change a
firm’s services, such as when a service is available on the Web (Detlor 2010).
Thus, information systems are a base for the creation of information services. The
idea behind providing such services is that stakeholder groups have access only to
the information but do not own the whole information system. Thus they do not
have the burden of maintaining or setting up measurement systems, data
collecting, or technological platform. In other words, information service is the
offering of a capability for generating, acquiring, storing, transforming,
processing, retrieving, utilizing, or making available information via
telecommunications, but it does not include any capability for the management,
control, or operation of a telecommunications system or the management of a
telecommunications service (Jadad 1999).
In this context, the Internet has created a new means not only for
communication but also for the access, sharing, and exchange of information
among people and machines (Jadad 1999). The Internet is a channel for providing
the information services for many kinds of applications that also enable citizens,
companies, decision makers and other end users to have an input in the decision-
making process and receive meaningful information.
1.3 Computational intelligence
Finding a needle in a haystack is really difficult and sometimes you do not even
have a clue as to what it is you are looking for, i.e. a needle or what? This is a
familiar situation when dealing with data sets. Computational methods are used to
process and analyse data, thus they provide some help in finding the needle in the
haystack. Intelligence in information services comes from the use of
computationally intelligent methods in data processing, modelling, and
visualization. The methods used often include neural networks, fuzzy systems and
19
evolutionary programming (Lu et al. 2007). Furthermore, according to one
definition, computational intelligence in methods requires the following
characteristics (Pal & Pal 2002):
1. considerable potential in solving real-world problems,
2. ability to learn from experience,
3. capability of self-organization and
4. the ability to adapt in response to dynamically changing conditions and
constraints.
For example, computationally intelligent methods have been applied to many e-
Services, giving benefits to online customers’ decision making, personalized
services, online searching, and data retrieval together with various web-based
support systems (Lu et al. 2007).
1.4 Data mining in environmental applications
Data mining is a part of the knowledge discovery process, which utilizes various
computational methods capable of handling a lot of data. Environmental systems
contain many interrelated components and processes, which may be biological,
physical, geological, climatic, chemical, or social. The problems related to
environmental systems exhibit a complexity that originates, for example, from
their multidisciplinarity, nonlinearity, high dimensionality, heterogeneity of data,
uncertainty, and imprecise information or cyclic behaviour. For these reasons and
due to the changing behaviour of natural systems, there is a need for data analysis,
modelling and development of decision support systems in order to understand
environmental phenomena and improve the management of the associated
complex problems (Jakeman et al. 2008). These features also occur in many
industrial systems and systems where people are involved.
Knowledge discovery from data (KDD) is a non-trivial process of identifying
valid, novel, potentially useful, and ultimately understandable patterns in data
(Fayyad et al. 1996). Data mining is often described as a part of KDD (see Figure
2), referring to the analysis of (often large) observational data sets to find
unsuspected relationships and to summarize the data in novel ways that are both
understandable and useful for the data owner. Data mining can be regarded as
consisting of exploratory data analysis, descriptive modelling, predictive
modelling (regression and classification), discovering patterns and rules or
retrieval by content (Hand et al. 2001).
20
Fig. 2. Overview of the knowledge discovering (KDD) in databases process (modified
from Fayyad et al. 1996, Pyle 1999).
In general, the knowledge discovering process is divided into six sequential
iterative steps comprising 1) problem definition or understanding of goals, 2) data
acquisition, 3) data pre-processing and transformations, 4) data modelling, 5)
interpretation and evaluation of results and 6) knowledge deployment. An
overview of this process is presented in Figure 2. Each step is vital and should be
carried out carefully. The problem defines what kind of data has to be used and
also gives a hint of the kind of solution that is wanted. The modelling of data
makes it possible to apply the results to new data. On the other hand, data
modelling without good understanding and careful preparation of the data may
produce incorrect results. After all, the whole data mining process is meaningless,
if the new knowledge will not be used in decision making or as a solution of the
end user’s problem (Pyle 1999, Fayyad et al. 1996)
Due to the complex behaviour of natural or environmentally related man-
made processes, suitable measuring, analysing and modelling methods face many
challenges (Jakeman et al. 2008). Among direct observation data, many other
indirect data sources can be adopted in solving environmental problems. For
21
example, public registers, census data, corporate operative data, and spatial
information are often available. This leads to a situation where there is a lot of
multidimensional data that has to be integrated to describe certain phenomena.
Observation data often comprises a time series measured over several years where
the primary influence on the modelling originates from the cycles of the nature
and from human activity. This causes seasonality and timely (hourly, daily or
weekly) variations, which are typical characteristics of environmental problems
that have to be taken into account in data mining tasks (Kolehmainen et al. 2000).
Furthermore, data quality is one of the success factors in data mining
approaches. Most computational methods require complete data sets and missing
data caused by measurement failures or human errors is problematic. Additionally,
systematic errors originating from erroneous calibration, noise, uncertainty, and
outliers are also challenging (Kanevski et al. 2004). A multitude of variables often
leads to a situation where the same information is used several times and there is
a need for feature selection or pre-processing. The lagged and nonlinear
interactions between variables exist in natural processes, setting requirements for
the pre-process and modelling methods used (Kolehmainen 2004). Moreover, the
dynamic behaviour of continuous processes requires also handling of time-lagged
variables in order to create successful online monitoring systems (Komulainen,
Sourander & Jämsä-Jounela 2004).
1.5 Aims of the thesis
The main aim of this thesis is to study and define the concept of intelligent
information services in the environmental field and the technology behind them.
The technology developed in this thesis relies heavily on KDD, described above.
A set of methods suitable for each phase of the data processing chain described in
Figure 2, is presented in the next section.
Moreover, one aim of this thesis is to encourage researchers, people and
organizations to use their data resources more innovatively. By combining
different data sources, new benefits can be achieved. Besides, the free availability
of public data resources combined with corporate data would enable the creation
of many new information services. Data policy should promote the development
of new services instead of preventing natural progress in this sector. At the very
least public data whose collection or measuring has been funded using public
taxes should be freely available. Therefore, this thesis encourages the innovative
use of the different kinds of available data sets.
22
Furthermore, the logic of intelligent services is clarified by presenting five
examples (Articles 1–5) concerning environmental and industrial applications.
Although the application areas of these examples differ, they are still good
examples of the two main aims of this thesis. Firstly, in all cases (Articles 1–5),
multiple data sets, which were originally produced for other purposes, were
integrated and refined to give more valuable information. Secondly, the data
mining methods applied are not restricted to handling problems in only a specific
field. On the contrary, the methods are generic and used widely in cases where a
vast amount of data is available.
1.6 The author’s contribution
This thesis consists of the scientific publications that are the result of several
research projects carried out at the Research Group of Environmental Informatics,
University of Eastern Finland (formerly the University of Kuopio) during 2005–
2010. The author was the coordinator and carried out most of the research work
and writing concerning Articles 1–5.The author’s contribution in each publication
is explained below.
In Article 1, the author created the data processing core and carried out
computations and writing while Harri Niska, Teri Hiltunen and Mikko
Kolehmainen gave assistance and guidance with numerical calculations using
MLP and validated the models. Jarkko Tiirikainen constructed the database and a
server-based web service.
Article 2 on the reduction in electricity use was prepared mainly by the author.
Mikko Kolehmainen assisted in the design of data mining tasks and validation
procedures.
The creation of data-based electricity load curves was carried out jointly by
the author and Dimitris Voukantsis (see Article 3). Mikko Kolehmainen provided
important information about utilizing load curves in the energy company’s
network information system. Harri Niska calculated the regression models for the
customer-specific temperature compensation indices. Kostas Karatzas took part in
the writing process and in the validation of the modelling results.
Article 4 focused on the clustering of time series data and was prepared
mainly by the author. Mikko Kolehmainen gave assistance and guidance with the
numerical calculations and Juhani Ruuskanen took part in the writing process.
In Article 5, the author’s role focused on applying MLP to solve the
correspondence between process variables. Risto Soukka, Sami Kokki and Yrjö
23
Hiltunen provided some important information about the pulp making process and
process life-cycle profit models.
1.7 The structure of the thesis
Chapter 1 of this thesis presents an introduction to the study, the aim of the
studies and the structure of the thesis. Chapter 2 clarifies the idea behind data
integration and presents the data and the information sources used in this study.
Chapter 3 introduces the main computational methods used in this thesis. Next,
the main results achieved are presented in Chapter 4. Discussions and ideas for
commercializing intelligent information services are presented in Chapter 5.
Finally, conclusions and suggestions for future work are presented in Chapter 6.
24
25
2 Data integration and information sources
2.1 Data integration
Data integration means combining data residing at different sources, and
providing the user with a unified view of these data. The problem of designing
data integration systems is of utmost importance in real world applications
(Lenzerini 2002). Data integration is essential in large enterprises or organizations
owning a multitude of data sources, for progress in large-scale scientific projects,
where multiple researchers are producing data sets independently. Moreover, data
integration is needed for better cooperation between government agencies, each
with their own data sources. The World Wide Web is also a good example where
structured data sources provide the possibility for efficient searching (Halevy,
Rajaman & Ordille 2006).
A data warehouse is a subject-oriented system environment, which is
constructed to fulfil the needs of data integration. In general, a data warehouse is
an informational environment that provides an integrated and total view of the
enterprise, for example, by making current and historical information easily
available for strategic decision making. The main idea is to make use of the large
volumes of existing data and to transform it into forms suitable for providing
strategic information (Ponniah 2010).
Data integration includes data pre-processing operations, which should
improve the usefulness of the data. Data warehousing is an environment, which is
a blend of many technologies, providing facilities for data pre-processing, data
analysis and decision support. Furthermore, it should be flexible, interactive and
completely user-driven. The environment should provide the capability to
discover answers to complex questions. It is important that business or application
requirements, and not technology, drive the creation of a data warehouse. The
main aim of data integration and use of data warehouses is (Ponniah 2010):
– to gather all the data from operational systems
– to include relevant data from outside
– to integrate all the data from various sources
– to remove inconsistencies and transform the data
– to store the data in formats suitable for easy access for decision making and
data analysis.
26
Typically, data warehousing includes a blend of technologies which all work
together in order to provide strategic information for every need separately. Many
companies have their own approaches, focusing on extracting the data, cleaning it
up, and making it available for analysis. Technology is used for actions like data
acquisition, data management, data modelling and metadata management that are
typical parts of a data warehouse (Ponniah 2010, Mattison 1996, Anahory &
Murray 1997). Furthermore, there are many architectures for data warehouses.
Figure 3 presents the centralized architecture of a data warehouse that was also
used in the information services prototypes (for example Article 1) of this thesis.
In this architecture, data is first gathered in organization-specific online
transaction processing (OLTP) systems which handle transaction-oriented
applications like data entry and retrieval transaction processing. After that data is
preprocessed and stored to data warehouse for the further refinement. Interfaces
between different parts of the system should be created using the public standards
available. This would improve the efficient use of data.
Fig. 3. Centralized architecture of data warehouse.
In this thesis, data integration had a major role because multiple data sources were
used in each case. The research work described in Articles 1–5 always started
with the tasks of data integration. After data acquisition, data was typically stored
in a relational database or an integrated file for modelling was constructed. In a
continuous application, data was pre-processed and stored in a MySQL relational
database, which was flexible and suitable for recurring queries and data
processing.
27
2.2 Information sources
The following chapters present the main information sources used in the
examples of this thesis and also some other useful data providers for
environmental applications.
2.2.1 Mobile telecommunications data
The number of mobile network subscriptions in Finland is 6.9 million, which
equals 130 subscriptions per 100 citizens (Statistics Finland, 2009). The mobile
telecommunications network is constructed so that a mobile phone is always
connected to some cell in an area covered by each base transceiver station (Drane
et al. 1998). Furthermore, most of the people carry mobile phones with them
when moving across the country. These facts give an opportunity to use the
mobile operators’ information systems in order to find out how many mobile
phones there are in a certain cell (area) in the desired time period, producing a
dynamic estimation of the amount of people in a region.
Telecommunications data is normally used for customer billing and network
management purposes. Moreover, GSM positioning is a method used to
determine the geographical location of an individual mobile phone user (Drane et
al. 1998). Instead of locating individual users, anonymous cell-based information
about mobile phones is suitable for estimating the number of people in a certain
region at a given time.
Recently, mobile telecommunications data has been applied in this manner to
predict regional tourist numbers, to understand human mobility (Gonzales et al.
2008) and traffic management (Messelodi et al. 2009). In Article 1, mobile
telecommunications data were collected from seven base transceiver stations
(BTS), which were a part of the mobile (GSM) network of Finnet Ltd., which has
a market share of approximately 20% of the Finnish mobile phone sector. The
area covered by each base transceiver station is called a cell (Drane et al. 1998)
and these seven cells cover most of the area of a tourist centre.
The collected raw data contains date, time, area code, cell identification code,
and type for every telecommunications event, i.e. all outgoing and incoming calls,
general packet radio service data transfers (GPRS) and short message services
(SMS). This raw data is transformed into four continuous time series by
calculating the events in a particular hour. In Article 1 the time resolution of a
time series is one hour.
28
2.2.2 Population information system
A census is the procedure of systematically acquiring and recording information
about the members of a given population, agriculture, business and industry. Most
countries have specific organizations to carry out procedures for collecting,
storing, maintaining and sharing census data. Such data is typically commonly
used for research, business marketing, and planning as well as a base for sampling
surveys.
The Finnish Population Information System is a national register containing
basic information about Finnish citizens and foreign citizens residing permanently
in Finland. Additionally, the system contains information about buildings,
construction projects, residences, and real estate. The National Population
Register Centre maintains the system and it is used in Finnish information
services and management, including public administration, elections, taxation,
judicial administration, research and statistics. Moreover, private organizations
can also have access to information which has been registered since the 1530s
(Population Register Centre 2009).
The population data is personal, including name, personal identity code,
address, citizenship and native language, family relations and dates of birth and
death. Building data includes the building code, location, owner, area, facilities
such as heating type and network connections, intended use and year of
construction. Real estate data includes the real estate unit identifier, owner’s name
and address, and buildings located on the property (Population Register Centre
2009).
The same kind of census information is available in many countries. The U.S.
Census Bureau provides national census data concerning, for example, people,
households, business, industry and also geography and maps (U.S. Census Bureau
2009). The U.K. Office for National Statistics maintains census data for England
and Wales. The collected census information allows local government, health
authorities and many other organizations to target their resources more effectively
and to plan housing, education, health, and transport services (U.K. Office for
National Statistics 2009).
In Article 2 the census data, especially building characteristics, provided by
the Finnish Population Information System are used to cluster buildings in order
to group electricity users. The aim is to provide more personal information to
electricity users, allowing comparisons between one’s own and other similar users’
electricity use.
29
2.2.3 Corporate operative data
In many companies, information systems are used to manage, control and report
daily operations like sales, logistics, process control, production and
environmental monitoring. Moreover, basic activities like customer relationships
and financial management or access control are handled using information
systems. These activities generate large quantities of data, giving solutions for
each individual need of the company. However, integrating this operative
information with other data sources could deliver new ideas for developing
company operations. In some cases, even financial benefits would be possible by
selling the data to other service providers. For example, mobile
telecommunications data has been applied for predicting tourist numbers (Article
1), understanding human mobility (Gonzales et al. 2008), and traffic management
(Messelodi et al. 2009). These applications are possible if the telecommunications
operator provides the needed data to a third party service provider. Furthermore,
other examples are presented in Article 1 where daily fresh water consumption is
used as a measurement describing the amount of people in a tourist centre and in
Article 2 where electricity use data is provided in order to change customers’
electricity use behaviour.
Industrial plants and factories also have large databases full of data describing
the behaviour of their own processes. Besides process monitoring (Uraikul et al.
2007), continuous observations may target the plant surroundings, producing
information about the state of the environment (Ackerman & Sundquist 2008).
Emissions to air, water, and soil are typical examples of this kind of observation,
which originated from private use but are also of great interest to other actors
such as the authorities. In Article 5, observation data from a pulp bleaching
process is used to solve the missing or unknown correlations between process
variables in order to create a more complete process life-cycle profit model. The
life-cycle profit model (LCP) is a tool that can be used to analyse the main
development tasks in order to obtain a more efficient process. In other words, the
main purpose of the plant’s LCP model is to recognize the development
possibilities, which can be achieved by process changes allocated to different
stages of the process.
In Article 5, the raw process data originates from the pulp mill process
control and measuring system. The data management system of the mill contains
detailed information about emissions, end products, and raw material demand,
30
which would be valuable for authorities, suppliers, product developers and other
interest groups.
2.2.4 Weather data
The weather plays a significant role in environmental phenomena and affects
natural processes in many ways. Observed weather data and forecasts are
examples of information that is needed in most applications. The national weather
observation network has been established in most countries. Moreover, there are
many privately-owned weather stations that are not connected to each other but in
many cases they are available for the use of local services.
Weather forecasts and climate statistics based on weather observations, and
the observations themselves are all essential in the production of a wide selection
of services and in atmospheric and environmental research activities. The Finnish
Meteorological Institute provides frequent observations for over 500 different
locations around the country. The users of this information are private citizens, or
those who operate in the fields of aviation, travel, agriculture, forestry, energy
production, or maintenance services (FMI 2009).
The Finnish Meteorological Institute (FMI) provides weather data as
requested. Observed data from surface weather stations is available dating back to
1959. Older observation sheets are stored in the central archives of the FMI. The
most common observation parameters are temperature, humidity, visibility, wind
speed, air pressure, precipitation, past and present weather, clouds, state of the
ground, and snow depth. Furthermore, FMI provides information about sunshine,
ultraviolet radiation, lightning, electromagnetic radiation and it also has several
weather cameras spread all over the country (FMI 2009).
2.2.5 Other environmental data sources
Public and private organizations maintain many other data sources which are
useful in environmental applications and information services. These data sources
have been used in the author’s previous research (Nuortio et al. 2004, Räsänen
2004) but not in Articles 1–5. In spite of this, these useful data sources are worth
introducing briefly as potential data sources for new environmental information
services.
The Grid Database contains coordinate-based statistical data calculated by
map grid. The available grid sizes are 250 m x 250 m and 1 km x 1 km, covering
31
the whole of Finland. The database contains data by selected key variables
describing the population structure, education, main type of activity and income,
households' stage in life and income, as well as buildings and workplaces. The
Grid Database is widely used for example in planning, research (Tainio et al.
2009) and market analysis (Koistinen & Väliniemi 2007). The Grid Database is
updated annually with the latest statistical data. The numerical database is
provided in dBase format and spatial information either in MapInfo (*.TAB) or
ArcView (*.SHP Shapefile) format. The coordinate system used is the uniform
coordinate system (KKJ3) (Statistics Finland 2008).
The environmental emission database (VAHTI) maintained by the Finnish
environmental administration functions as a tool for the 13 regional environment
centres in their work on processing and monitoring permits. The system contains
information on the environmental permits of clients and on the wastes generated,
discharges into water and emissions to air. In the future, the system will also
include information on noise emissions. In 2003 the database contained
information concerning 31000 clients. The system is used by the environment
centres and by other interested parties, with 800 active users. The VAHTI system
is a tool for environmental administration but the data is also available for other
interested parties who need information (Finnish Environment Institute 2010a).
The Environmental Information System (HERTTA) consists of subsystems
containing information on, for example, monitoring of water quantity and quality,
environmental protection, biological diversity, land use and environmental
loading. The data system is also maintained by the Finnish environmental
administration and has the following subsystems (Finnish Environment Institute
2010b):
– Air Emission Data system
– Data bank of Environmental Properties of Chemicals (EnviChem)
– Database of Threatened Species
– Forms for monitoring local detailed plans
– Groundwater Database
– Hydrology and Water Resources Management Data system
– Information System for Monitoring Land Use Planning
– Information System for Monitoring the Living Environment
– Lake register
– Phytoplankton Database
– State of Finland's Surface Waters
32
The Geological Survey of Finland generates information on the Earth’s crust and
its natural resources and hosts information systems providing data to the public.
These systems are utilized mainly by government agencies and exploration
companies in geological mapping, environmental studies and urban planning. Part
of the material is chargeable by law and also available for free. Data support
activities concern, for example, geophysics, geochemistry, bedrock, minerals and
surface geology (Geological Survey of Finland 2010).
Environmental problems are typically spatially related and spatial data plays
a significant role in planning and decision making. The National Land Survey
(NLS) of Finland produces and provides information on and services in real estate,
topography and the environment for the needs of citizens, other customers and the
community at large. The NLS is responsible for Finland’s cadastral system and
general mapping assignments. It also promotes the shared use of geographic
information.
33
3 Data pre-processing and computational methods
3.1 Pre-processing of data
The connection between the expected phenomenon and measurements is not
always clear. The measurements have to be turned and organized into data sets
which describe the problem linked to the real world destination. This means that
some adjustments, alterations and reformatting have to be applied to the data sets
to prepare them for further operations (Pyle 1999). Data pre-processing describes
any type of processing performed on raw data to prepare it for another processing
procedure, such as modelling or statistical analysis. In other words, this
preliminary data mining practice, pre-processing, transforms the data into a
format that will be more easily and effectively processed for the purpose of the
user. Typically data pre-processing includes procedures such as (Han & Kamber
These models are implemented in a variety of ways or several models are
combined to build the final Internet business strategy (Rappa 2010).
Business models like information brokerage, infomediary and utility are
suitable for the intelligent information services presented in this thesis. Brokers
are market-makers, bringing buyers and sellers together and facilitating
transactions. Brokers provide information services for business-to-business (B2B),
Problemidentification
Data acquisitionand storing Data processing Modeling and
analyses
Knowledgedeployment and
informationsharing
Core expertise of Service provider
73
business-to-consumer (B2C), or consumer-to-consumer (C2C) markets. An
information broker collects data and refines it into valuable information. Usually
a broker charges a fee or commission through subscription or on a pay-per-use
basis. Consultancy services, such as special data analyses, are sometimes
combined with information services. In this case the revenues come from
consultancy fees (Rappa 2010).
The infomediary business model deals with data about consumers and their
consumption habits, which is valuable information, especially when that
information is carefully analysed and used, for example, for targeted marketing
campaigns. Furthermore, this information is useful in the utility sector where
added value is achieved by cost savings from improved planning of operative
systems and actions. Independently collected data about producers and their
products are useful to consumers when considering a purchase. Some firms
function as infomediaries (information intermediaries) assisting buyers and/or
sellers (Rappa 2010).
74
75
6 Conclusions
The innovative use of stored data is one of the main elements in the creation of
future information services. Thus, there should be more emphasis on the
development of effective data processing methods. Furthermore, it was noticed
that final end users, such as citizens or decision makers, should be taken into
account in the data refining process at a very early stage. In this way the approach
is truly customer-oriented and results in the fulfilment of the concrete needs of the
end users. The collaboration of actors in the environmental field and the new
technology gives a solid basis for new environmental information services. There
is an opportunity for at least three types of services, namely 1) continuous
monitoring services, 2) data analysing services and 3) the creation of intelligent
software components to perform the core parts of data processing platforms.
Research work during this thesis has raised many new questions and ideas for
future work, and the following list contains some of the main recommendations:
Data is the origin of information services. More effort should be directed
towards the free availability of public data and overall accessibility.
Data providers should enter into co-operation more in order to see new
possibilities to use costly collected data.
Open and standardized software approaches are needed for the technological
solutions used as service platforms.
New approaches and additional value can be achieved by combining different
computational and data processing methods.
The end user is interested only in finding a solution to his problem. Services
should provide only the required information to the end user and other
functionalities or technology should be hidden. The service operator is
responsible for taking care of the technological platform and data
maintenance.
Fortunately, the recommendations above are already seeing some concrete
improvements, at least in the Finnish environmental sector. The collaboration
between public organizations and companies has started. One example is the
creation of the strategic centre for science, technology and innovation of the
Finnish energy and environment cluster called CLEEN Ltd. This consortium has
started, for example, a research program focusing mainly on measuring,
76
monitoring and environmental assessment. The aim of the MMEA research
consortium is to develop new tools, standards and methods for environmental
measurement, monitoring and decision support for the national and international
markets. In general, the overall purpose is to promote new environmental data-
based applications and services to improve the energy and material efficiency of
infrastructures and industrial processes (Cleen Ltd. 2011). Furthermore,
international progress has also been made in this field. The European
Environment Agency has created a shared environmental information system
(SEIS), which serves policy-makers and provides reliable and increasingly real-
time information for determining the most appropriate course of action (EEA
2011).
When looking at overall progress from the technology point of view, one new
example is the SensorML, which is approved as the Open Geospatial Consortium
standard, providing models and XML encoding for the description of sensors and
measurement processes. The standard helps to describe a wide range of dynamic
and stationary platforms and both in-situ and remote sensors. This standard also
enables the development of plug-n-play sensors, simulations, and processes that
are seamlessly added to decision support systems (Open Geospatial Consortium
2011).
The examples above illustrate the progress being made in the field of
environmental measurement and monitoring, but these are just the beginning. The
service business in this sector is still quite small and plenty of possibilities are
waiting for competent and progressive companies or organizations. More benefits
can be discovered by courageous collaboration between organizations, by
exploiting computational methods for data processing and by the use of the latest
technology.
77
References
Ackerman KV & Sundquist ET (2008) Comparison of Two U.S. Power-Plant Carbon Dioxide Emissions Data Sets. Environmental Science & Technology 42: 5688–5693.
Anahory S & Murray D (1997) Data Warehousing in the Real World: A Practical Guide for Building Decision Support Systems. Boston, Addision-Wesley.
Baek J, Geehyuk L, Wonbae P & Byoung-Ju Y(2004) Accelerometer Signal Processing for User Activity Detection. KES 2004 LNAI 3215: 610–617.
Bawden D & Robinson L (2009) The dark side of information: overload, anxiety and other paradoxes and pathologies. Journal of Information Science 35:180–191.
Box GEP, Jenkins GM & Reinsel GM (1970) Time Series Analysis – Forecasting and Control. John Wiley & Sons.
Bracke MBM, Edwards SA, Engel B, Buist WG & Algers B (2008) Expert opinions as ‘validation’ of risk assessment applied to calf welfare. Acta Veterinaria Scandinavica 50:1–12.
Brooks R.A (1991) Intelligence without representation. Artificial Intelligence 47:139–159. Castells M & Himanen P (2002) The Information Society and the Welfare State: The
Finnish Model. Oxford University Press. Cawley GC & Talbot NLC (2003) Efficient leave-one-out cross-validation of kernel
Fischer discriminant classifiers. Pattern Recognition 36: 2585–2592. Cios K, Pedrycz W, Swiniarski R & Kurgan L (2007) Data Mining – A Knowledge
Discovery Approach. Springer. Chesbrough H (2007) Business model innovation: it’s not just about technology anymore,
Strategy & Leadership 35: 12–17. Cleen Ltd (2011) Website. URI: http://www.cleen.fi/home/. Cited 2011/5/22. Commission of the European Communities (2009) Re-use of Public Sector Information –
Review of Directive 2003/98/EC. Crowley JL, Reignier P & Coutaz J (2006) Contex Aware Services. In: Aarts EHL &
Encarnacao JL (eds) True Visions: The Emergence of Ambient Intelligence. Berlin Heidelberg, Springer-Verlag: 233–246.
Crowther PS & Robert JC (2005) A Method for Optimal Division of Data Sets for Use in Neural Networks. Proceedings of the Knowledge-based intelligent information and engineering systems: 9th international conference, KES 2005, LNAI 3684, Melbourne, Australia: 1–67.
Davies D & Bouldin D (1979) A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 2: 224–227.
Detlor B (2010) Information management. International Journal of Information Management 30: 103–108.
Drane C Macnaughtan M & Scott C(1998) Positioning GSM Telephones. IEEE Communications Magazine 36: 46–54.
EEA (2011) Website of Shared Environmental Information System (SEIS). URI: http://www.eea.europa.eu/about-us/what/shared-environmental-information-system. Cited 2011/25/5.
78
Ellis CN, Drake LA, Prendergast MM, Abramovits W, Boguniewicz M, Daniel CR, Lebwohl M, Stevens SR, Whitaker-Worth DL & Tong KB (2003) Validation of Expert Opinion in Identifying Comorbidities Associated with Atopic Dermatitis/Eczema. PharmacoEconomics: 21:875–883.
Fayyad U, Piatetsky-Shapiro G & Smyth P (1996) From data mining to knowledge discovery in databases. AI Magazine 17: 37–54.
Finnish Environment Institute (2010a), Website for The Compliance Monitoring Data system – VAHTI, URI: http://www.ymparisto.fi/default.asp?contentid=142451&lan= fi&clan=en. Cited 2010/4/27.
Finnish Environment Institute (2010b) Website for data systems. URI: http://www.ymparisto.fi/default.asp?contentid=347060&lan=EN. Cited 2010/4/27.
FMI – Finnish Meteorological Institute(2009) Website for weather stations in Finland. URI: http://www.fmi.fi/weather/stations.html. Cited 2009/11/3.
Fränti P & Virmajoki O (2006) Iterative shrinking method for clustering problems. Pattern Recognition 39: 761–765.
Fugui L., Bing Xiong & Bing Xu (2008) Improving public access to environmental information in China. Journal of Environmental Management 88: 1649–1656.
Garthwaite PH & Dickey JM (1996) Quantifying and using expert opinion for variable-selection problems in regression. Chemometrics and Intelligent Laboratory Systems 35: 1–26.
Geological Survey of Finland (2010) Website of Geological Information and Publications. URI: http://en.gtk.fi/Geoinfo/. Cited 2010/4/27.
George EI (2000) The Variable Selection Problem. Journal of American Statistical Association 95: 1304–1308.
González MC & Hidalgo CA & Barabási A-L (2008) Understanding individual human mobility patterns. Nature 453: 779–782.
Guyon I & Elisseef A (2003) An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3: 1157–1182.
Günther O (1998) Environmental Information Systems. Berlin Heidelberg, Springer-Verlag.
Halkidi M, Batistakis Y& Vazirkiannis M (2001) On Clustering Validation Methods. Journal of Intelligent Information Systems 17: 107–145.
Han J & Kamber M (2000) Data Mining: Concept and Techniques. Morgan Kaufman Publishers.
Hand D, Mannila H& Smyth P (2001) Principles of Data Mining. Cambridge MA, MIT Press.
Halevy A, Rajaraman A & Ordille J (2006) Data integration: the teenage years. Proceedings of the 32nd international conference on very large data bases, VLDB’06.
Haykin S (2002) Neural Networks: A Comprehensive Foundation. 2nd ed. New Jersey, Prentice-Hall.
Hawkins DM (2004) The Problem of Overfitting. Journal of Chemical Information and Modeling: 44: 1–12.
79
Hey T, De Roure D & Trefethen AE (2006) e-Infrastructure and e-Science. In: Aarts EHL & Encarnacao JL (ed) True Visions: The Emergence of Ambient Intelligence. Berlin Heidelberg, Springer-Verlag: 211–231.
Hyötyläinen M & Möller K (2007) Service packaging: key to successful provisioning of ICT business solutions. Journal of Services Marketing 21: 304–312.
Jadad AR (1999) Promoting partnerships: challenges for the internet age. British medical journal 319: 761–764.
Jain AK, Murty MN & Flynn PJ. Data Clustering: A Review. ACM Computing Surveys 31: 264–323.
Jakeman AJ, Voinov A, Rizzoli AE& Chen S (2008) Environmental Modeling and Software (Developments in Integrated Environmental Assessment): State of the Art and New Perspectives. Elsevier.
Johnson CJ & Gillingham MP (2004) Mapping uncertainty: sensitivity of wildlife habitat ratings to expert opinion. Journal of Applied Ecology 41: 1032–1041.
Kafentsis K, Mentzas G, Apostolou D & Georgolios P (2004) Knowledge marketplaces: strategic issues and business models. Journal of Knowledge Management 8: 130–146.
Kalin M (2009) Java Web Services: Up and Running. O’Reilly. Kanevski M, Parkin R, Pozdnukhov A, Timonin V, Maignan M, Demvanov V & Canu S
(2004) Environmental data mining and modeling based on machine learning algorithms and geostatistics. Enviromental Modelling & Software 19: 845–855.
Karol R & Nelson B (2007) New Product Development for Dummies. Wiley Publishing. Kohonen T (1997) Self-organizing maps. 2nd ed. Berlin, Springer-Verlag. Koistinen K & Väliniemi J (2007) Are grocery stores easily accessible? National
Consumer Research Centre publications 4/2007. Kolehmainen M (2004) Data exploration with self-organizing maps in environmental
informatics and bioinformatics. Ph.D. Thesis, Kuopio University Publications C. Natural and Environmental Sciences 167.
Kolehmainen M, Martikainen H, Hiltunen T & Ruuskanen J (2000) Forecasting air quality parameters using hybrid neural network modelling. Environmental monitoring and assessment 65: 277–286.
Komulainen T, Sourander M & Jämsä-Jounela SL (2004) An online application of dynamic PLS to a dearomisation process. Computers and Chemical Engineering 28: 2611–2619.
Kovács F, Legány C & Babos A (2005) Cluster Validity Measurement Techniques. Proceedings of the 6th International Symposium of Hungarian Researchers on Computational Intelligence. Budapest.
Kubinyi H (1996) Evolutionary variable selection in regression and PLS analyses. Journal of Chemometrics 10: 119–133.
Lenzerini M (2002) Data Integration – A Theoretical Perspective. Proceedings of the Symposium on Principles of Database Systems (PODS): 233–246.
Liao W (2005) Clustering of Time Series Data - A Survey. Pattern Recognition 38: 1857–1874.
80
Lu J, Ruan D & Zhang G (2008) E-Service Intelligence – Methodologies, Technologies and Applications. Studies in Computational Intelligence 37. Berlin Heidelberg, Springer-Verlag.
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1. Berkeley, University of California Press: 281–97.
Masters T (1995) Neural, Novel & Hybrid Algorithms for Time Series Prediction. New York, John Wiley & Sons.
Mattison R (1996) Data Warehousing: Strategies, Technologies and Techniques. McGraw-Hill.
Mentzas G, Kafentsis K & Georgilios P (2007) Knowledge services on the semantic web – Developing infrastructures for trading knowledge services using semantic web. Communications of the ACM 50: 5358.
Messelodi S, Modena CM, Zanin M, Natale FGB, Granelli F, Betterle E & Guarise A (2009) Intelligent extended floating car data collection. Expert Systems with Applications 36: 4213–4227.
Niska H, Heikkinen M & Kolehmainen M (2006) Genetic algorithms and sensitivity analysis applied to select inputs of a multi-layer perceptron for the prediction of air pollutant time series. Proceedings of the 7th International Conference of Intelligent Data Engineering and Automated Learning. IDEAL 2006. Burgos, Spain, Springer Verlag. Lecture Notes in Computer Science: 224–231.
Nuortio T, Dool van den G, Hiltunen T, Matikka V, Räsänen T & Kolehmainen M (2004) Data and Information Utilisation in Waste Management Systems. In: Popov V, Itoh H, Brebbia CA & Kungolos S (eds) Waste management 2004. Waste management and the Environment II. Second International Conference on Waste Management and the Environment. Rhodes, Greece. WITpress.
Open Geospatial Consortium (2011) Website of Sensor Model Language (SensorML). URI: http://www.opengeospatial.org/standards/sensorml. Cited 2011/5/25.
Open Geospatial Consortium (2007) OGC White Paper – Sensor Web Enablement: Overview And High Level Architecture.
Pal NR & Pal S (2002) Computational intelligence for pattern recognition. International Journal of Pattern Recognition and Artificial Intelligence 16: 773–779.
Pillmann W, Geiger W & Voigt K (2006) Survey of environmental informatics in Europe. Environmental Modelling & Software 21: 1519–1527.
Ponniah P (2010) Data Warehousing Fundamentals for IT Professionals. New Jersey, John Wiley & Sons.
Population Register Centre (2009) Population Information System – Web Site. URI: http://www.vaestorekisterikeskus.fi/vrk/home.nsf/www/populationinformationsystem. Cited 2009/10/14.
Pyle D (1999) Data Preparation for Data Mining. San Francisco, Morgan Kaufmann. Rainer K & Turban E (2009) Introduction to information systems supporting and
transforming business. 2nd ed. John Wiley & Sons.
81
Rappa M (2010) Managing the Digital Enterprise. URI: http://digitalenterprise.org/ index.html. Cited 2010/12/28.
Ravi N, Dandekar N, Mysore P & Littman ML (2005) Activity Recognition from Accelometer Data. The Twentieth National Conference on Artificial Intelligence AAAI2005. Stanford, American Association for Artificial Intelligence.
Räsänen T (2004) Development of waste management information systems and reporting; graphical reporting, spatial visualising and data mining. MSc thesis, University of Oulu, Faculty of technology, Department of process and environmental engineering.
Sammon JW Jr (1969) A nonlinear mapping for data structure analysis. IEEE Transactions on Computers C-18: 401–409.
Salmi O (2007) Eco-efficiency and industrial symbiosis – a counterfactual analysis of a mining community. Journal of Cleaner Production 15: 1696–1705.
Shenk D (2003) Information overload, Concept of. Encyclopedia of International Media and Communications 2.
Snee R (1977) Validation of regression models: Methods and Examples, Technometrics 19: 415–428.
Sprott JC (2003) Chaos and Time series Analysis. Oxford University Press. Statistics Finland (2008) Grid Database – product description. URI: http://www.stat.fi/
tup/ruututietokanta/griddatabase2008.pdf.Cited 2009/10/14. Statistics Finland (2008) Telecommunications: Subscriber lines and subscriptions. URI:
Tainio M, Sofiev M, Hujo M, Tuomisto JT, Loh M, Jantunen MJ, Karppinen A, Kangas L, Karvosenoja N, Kupiainen K, Porvari P & Kukkonen J (2009) Evaluation of the European population intake fractions for European and Finnish anthropogenic primary fine particulate matter emissions. Atmospheric Environment 43: 3052–3059.
U.K. Office for National Statistics (2009) Census Web Site. URI: http://www.ons.gov.uk/census/index.html.Cited 2009/10/14.
Uraikul V, Chan CW & Tontiwachwuthikul P (2007) Artificial intelligence for monitoring and supervisory control of process systems. Engineering Applications of Artificial Intelligence 20: 115–131.
U.S. Census Bureau (2009) URI: http://www.census.gov/. Cited 2009/10/14. Usländer T, Jacques P, Simmon I & Watson K (2010) Designing environmental software
applications based upon an open sensor service architecture. Environmental Modelling & Software 25: 977–987.
Valera M & Velastin SA (2005) Intelligent distributed surveillance systems: a review. IEEE Proceedings – Vision, Image, and Signal Processing 152: 192–204.
Välimäki J (2002) The information society and the use of sustainable development indicators. Futura 2: 69–75.
Veryzer RW (1998) Discontinuous Innovations and the New Product Development Process. Journal of Product Innovation Management 15: 304–321.
Vesanto J & Alhoniemi E (2000) Clustering of the Self-Organising Map. IEEE Transactions on neural networks 11: 586–600.
82
W3C (2010) Website of Web of Services. URI: http://www.w3.org/standards/ webofservices/. Cited 2010/5/7.
Wang X, Smith K & Hyndman R (2006) Characteristic-Based Clustering for Time Series Data. Data Mining and Knowledge Discovery 13: 335–364.
Willmot C (1982) Some comments on evaluation of the model performance. Bulleting of American Meteorological Society 63: 1309–1313.
Xu R & Wunsch D (2005) Survey of Clustering Algorithms. IEEE Transactions on neural networks 16: 645–678.
83
Original publications
I Räsänen T, Niska H, Hiltunen T, Tiirikainen J & Kolehmainen M (2009) Predictive system for monitoring regional visitor attendance levels in large recreational areas. Journal of Environmental Informatics 13: 45–55.
II Räsänen T, Ruuskanen J & Kolehmainen M (2008) Reducing energy consumption by using self-organizing maps to create more personalized electricity use information. Applied Energy 85: 830–840.
III Räsänen T, Voukantsis D, Niska H, Karatzas K & Kolehmainen M (2010) Data-based method for creating electricity use load profiles using large amount of customer-specific hourly measured electricity use data. Applied Energy 87: 3538–3545.
IV Räsänen T & Kolehmainen M (2009) Feature-based clustering for electricity use time series data. ICANNGA 2009. Lecture Notes of Computer Science 5495: 401–412.
V Räsänen T, Soukka R, Kokki S & Hiltunen Y (2008) Neural networks in process life cycle profit modelling. Expert Systems with Applications 35: 604–610.
Reprinted with permission from International Society for Environmental
Information Sciences (I) and Elsevier (II, III, IV, V)
Original publications are not included in the electronic version of the dissertation.
84
A C T A U N I V E R S I T A T I S O U L U E N S I S
Book orders:Granum: Virtual book storehttp://granum.uta.fi/granum/
S E R I E S C T E C H N I C A
387. Juha, Karjalainen (2011) Broadband single carrier multi-antenna communicationswith frequency domain turbo equalization
388. Martin, David Charles (2011) Selected heat conduction problems inthermomechanical treatment of steel
389. Nissinen, Jan (2011) Integrated CMOS circuits for laser radar transceivers
390. Nissinen, Ilkka (2011) CMOS time-to-digital converter structures for theintegrated receiver of a pulsed time-of-flight laser rangefinder
391. Kassinen, Otso (2011) Efficient middleware and resource management in mobilepeer-to-peer systems
392. Avellan, Kari (2011) Limit state design for strengthening foundations of historicbuildings using pretested drilled spiral piles with special reference to St. John’sChurch in Tartu
393. Khatri, Narendar Kumar (2011) Optimisation of recombinant protein productionin Pichia pastoris : Single-chain antibody fragment model protein
394. Paavola, Marko (2011) An efficient entropy estimation approach
395. Khan, Zaheer (2011) Coordination and adaptation techniques for efficientresource utilization in cognitive radio networks
396. Koskela, Timo (2011) Community-centric mobile peer-to-peer services:performance evaluation and user studies
397. Karsikas, Mari (2011) New methods for vectorcardiographic signal processing