Analyzing Time Interval Data Philipp Meisen Introducing an Information System for Time Interval Data Analysis
Analyzing Time Interval Data
Philipp Meisen
Introducing an Information System for Time Interval Data Analysis
Analyzing Time Interval Data
Philipp Meisen
Analyzing Time Interval DataIntroducing an Information System for Time Interval Data Analysis
Philipp MeisenAachen, Germany
ISBN 978-3-658-15727-2 ISBN 978-3-658-15728-9 (eBook) DOI 10.1007/978-3-658-15728-9
Library of Congress Control Number: 2016952631
Springer Vieweg © Springer Fachmedien Wiesbaden GmbH 2016 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made.
Printed on acid-free paper
This Springer Vieweg imprint is published by Springer NatureThe registered company is Springer Fachmedien Wiesbaden GmbHThe registered company address is: Abraham-Lincoln-Str. 46, 65189 Wiesbaden, Germany
D82 (Diss. RWTH Aachen University, 2015)
Acknowledgments
For Edison and Isaac
First of all, I want to thank all the people that helped me making this
work possible. Especially, I want to mention Sabina Jeschke for her super-
vision and advice, my managing director, friend, and brother Tobias Meisen
for sharing his knowledge and experience and pushing me whenever
needed, my co-worker and friend Christian Kohlschein for listening, having
endless discussions and reviewing my work, Angelika Reimer for creating
the illustrations, and Diane Wittman for helping me formatting the book.
I also want to give some special thanks and dedications to the people,
which follow me my whole life like my own shadow. My elder brother Holger,
who helped me whenever I was in doubt, my already mentioned twin-
brother Tobias for all the “Schokostreuselbrötchen” and discussions, my
parents for making all this possible by having, loving, and supporting me,
and also my dearest friends Tummel, Hoomer, Christian, Diane, and Marco
for every talk, time-out, and drink we had. Thank you all, for being there for
me whenever needed.
Last but not least, I want to express my deepest gratitude to my wife
Deborah for her support whenever it was needed. Without her this work
would never have been possible.
Philipp
Abstract
Time interval data is data which associates information with a specific time
range (i.e., a time window) defined by a start- and an end time point. Thus,
time intervals are a generalization of time points, i.e., each time point is a
time interval having the same start- and end time point. Nowadays, huge
sets of time interval data is collected in various situations, e.g., personnel
deployment, equipment usage, process control, or process management.
Common systems are not capable to analyze these amounts of time inter-
val data. Questions like “How many resources were utilized on Mondays in
an annual average” or “Which days overlap with the planning and which
are diametrically” cannot be answered utilizing modern systems or need
extensive data integration processes.
In this thesis, a model to analyze time interval data (TIDAMODEL) is in-
troduced. Based on this model, a query language (TIDAQL) is defined,
which can be utilized to answer complex questions as presented in the
previous chapter. Furthermore, a similarity measure based on different
types of distance measures (TIDADISTANCE) is presented. This similarity
measure enables users to search for similar situations within a time interval
database. The different solutions are combined to design and realize the
central result of the thesis, i.e., an information system to analyze time in-
terval data (TIDAIS). The introduced system utilizes different, bitmap based
indexes, which enable the system to handle huge amounts of data.
The results of the evaluation show that the presented implementation
fulfills the requirements formulated by different stakeholders. In addition, it
outperforms state-of-the-art solutions (e.g., solutions based on the Oracle
database management system, icCube, or TimeDB).
Zusammenfassung
Zeitintervalldaten sind Daten welche innerhalb eines Zeitfensters, d.h. zwi-
schen einem Start- und Endzeitpunkt, erfasst werden und eine Verallge-
meinerung von Zeitpunktdaten darstellen. Heutzutage werden immer häu-
figer große Mengen von Zeitintervalldaten in Bereichen wie z.B. der Perso-
naldisposition, Gerätenutzung, Prozesssteuerung oder Planung erfasst.
Die Auswertung von diesen Daten stellt gängige Analysesysteme vor
große Herausforderungen. Fragestellungen wie „Wie viele Ressourcen
wurden im Jahresdurchschnitt montags über den Tag verteilt in der Ferti-
gung benötigt?“ oder „Welche Tage sind bzgl. der Planung am genausten
und welche verlaufen diametral“ können meistens mit modernen Systemen
gar nicht modelliert oder nur durch Verwendung von langwierigen Integra-
tionsprozessen beantwortet werden.
In dieser Arbeit wird zunächst eine auf diskreten Zeitachsen basierende
Modellierung (TIDAMODEL) vorgestellt. Basierend auf dieser Modellierung
wird im Weiteren eine Anfragesprache (TIDAQL) definiert, welche die Be-
antwortung komplexer Fragestellungen, wie weiter oben angedeutet, er-
möglicht. Neben der Beantwortung von Fragen ist die Suche nach ähnli-
chen Gegebenheiten eine wichtige Eigenschaft von Informationssystemen.
Um diese Ähnlichkeitssuche zu ermöglichen, wird in der Arbeit ein Ähn-
lichkeitsmaß (TIDADISTANCE) präsentiert. Diese einzelnen vorgestellten
Teilergebnisse werden genutzt, um das zentrale Ergebnis der Arbeit, ein
Informationssystem zur Analyse von Zeitintervalldaten (TIDAIS), zu entwer-
fen und zu realisieren. Das vorgestellte System basiert dabei auf Bitmaps,
welche die Auswertung von großen Datenmengen von Zeitintervalldaten
ermöglicht. Die Evaluierungsergebnisse zeigen, dass das vorgestellte Sys-
tem andere Lösungen (z.B. Lösungen die auf icCube, TimeDB oder mo-
derne Datenbankmanagementsysteme wie Oracle basieren) bzgl. der Aus-
wertungsperformanz übertrifft.
Table of Contents
Acknowledgments V
Abstract VII
Zusammenfassung IX
Table of Contents XI
List of Abbreviations XV
List of Figures XIX
List of Tables XXV
List of Listings XXVII
List of Definitions XXXI
1 Introduction and Motivation 1
2 Time Interval Data Analysis 7
2.1 Time 7
2.1.1 Time Intervals 7
2.1.2 Time Interval Data Aggregation 10
2.1.3 Temporal Models 14
2.1.4 Temporal Operators 20
2.1.5 Temporal Concepts 22
2.1.6 Special Characteristics of Time 23
2.2 Features of Time Interval Data Analysis Information System 29
2.2.1 Analytical Capabilities 30
2.2.2 Time Interval Data Analysis Process 35
2.2.3 User Interface, Visualization, and User Interactions 42
2.3 Summary 43
3 State of the Art 45
3.1 Analytical Information Systems 45
3.2 Analyzing Time Interval Data: Different Approaches 46
3.2.1 On-Line Analytical Processing 47
3.2.2 Temporal Pattern Mining & Association Rule Mining 52
3.2.3 Visual Analytics 54
XII Table of Contents
3.3 Performance Improvements 56
3.3.1 Indexing Time Interval Data 56
3.3.2 Aggregating Time Interval Data 60
3.3.3 Caching Time Interval Data 61
3.4 Analytical Query Languages for Temporal Data 62
3.5 Similarity of Time Interval Data 67
3.6 Summary 70
4 TIDAMODEL: Modeling Time Interval Data 73
4.1 Time Axis 73
4.2 Descriptors 76
4.3 Time Interval Database 80
4.4 Dimensional Modeling 82
4.5 Summary 87
5 TIDAQL: Querying for Time Interval Data 91
5.1 Data Control Language 92
5.2 Data Definition Language 95
5.3 Data Manipulation Language 96
5.3.1 Insert, Delete, & Update Statements 97
5.3.2 Get & Alive Statements 99
5.3.3 Select Statements 100
5.4 Summary 108
6 TIDADISTANCE: Similarity of Time Interval Data 111
6.1 Temporal Order Distance 113
6.2 Temporal Relational Distance 115
6.3 Temporal Measure Distance 117
6.4 Temporal Similarity Measure 118
7 TIDAIS: An Information System for Time Interval Data 121
7.1 System’s Architecture, Components, and Implementation 121
7.1.1 Data Repository 125
7.1.2 Cache & Storage 127
7.2 Configuration 129
Table of Contents XIII
7.2.1 Model Configuration 130
7.2.2 System Configuration 145
7.3 Data Structures & Algorithms 149
7.3.1 Model Handling 150
7.3.2 Indexes 156
7.3.3 Caching & Storage 165
7.3.4 Aggregation Techniques 167
7.3.5 Distance Calculation 171
7.4 User Interfaces 176
7.5 Summary 178
8 Results & Evaluation 181
8.1 Requirements & Features 181
8.2 Performance 187
8.2.1 High Performance Collections 188
8.2.2 Load Performance 189
8.2.3 Selection Performance 190
8.2.4 Distance Performance 196
8.2.5 Proprietary Solutions vs. TIDAIS 197
8.3 Summary 201
9 Summary and Outlook 203
Appendix 205
Pipelined Table Functions (PL/SQL Oracle) 205
A Complete Sample Model-Configuration-File 206
A Complete Sample Configuration-File 211
Detailed Overview of the Runtime Performance 215
3-NN of the Temporal Relational Similarity 217
Bibliography 219
List of Abbreviations
AD Active Directory
AIS Analytical Information System
AJAX Asynchronous JavaScript and XML
ANSI American National Standards Institute
ANTLR Another Tool for Language Recognition
API Application Programming Interface
ARTEMIS Assessing coRrespondence of Temporal Events Measure for
Interval Sequences
BI Business Intelligence
CET Central European Time (time zone)
CPU Central Processing Unit
CSS Cascading Style Sheets
CSV Comma Separated Value
DBMS Database Management System
DCL Data Control Language
DDL Data Definition Language
DML Data Manipulation Language
DSS Decision Support System
DST Daylight Saving Time
DTW Dynamic Time Warping
DW Data Warehouse
JDBC Java Database Connectivity
JMS Java Message Service
JSON Java Simple Object Notation
GB Giga Byte
GIS Geographic Information System
GPU Graphics Processor Unit
GTA General Temporal Aggregation
GUI Graphical User Interface
HCC Hybrid Columnar Compression
XVI List of Abbreviations
HOLAP Hybrid OLAP
HTML HyperText Markup Language
HTTP Hypertext Transport Protocol
IBSM Interval Based Sequence Matching
ISO International Organization for Standardization
ITA Instant Temporal Aggregation
k-NN k-nearest neighbors
LDAP Lightweight Directory Access Protocol
LRU Least Recently Used (cache algorithms)
MB Mega Byte
MDX Multidimensional Expressions
MOLAP Multidimensional OLAP
MRU Most Recently Used (cache algorithms)
MWTA Moving-Window Temporal Aggregation
NoSQL Not Only SQL
OLAM On-Line Analytical Mining
OLAP On-Line Analytical Processing
PDT Pacific Daylight Time (time zone)
PL/SQL Procedural Language/Structured Query Language
POJO Plain Old Java Object
ROLAP Relational OLAP
RR Random Replacement (cache algorithms)
RQ Research Question
SQL Structured Query Language
STA Span Temporal Aggregation
SVG Scalable Vector Graphics
TAT Two-step Aggregation Technique
TIDA Time Interval Data Analysis
UI User Interface
UTC Coordinated Universal Time (time zone)
XML Extensible Markup Language
List of Abbreviations XVII
XSD XML Schema Definition
XSLT Extensible Stylesheet Language Transformation
List of Figures
Figure 2.1 Apple falling from tree, example of a time interval and as-
sociated information observed, measured or calculated
during the process of an apple falling from a tree. 8
Figure 2.2 Machine performance, example of a time interval and as-
sociated information observed, measured, or calculated
during the execution of a task by a machine. 9
Figure 2.3 Example of ITA and MWTA (temporal aggregation forms
creating constant intervals). 12
Figure 2.4 Example of STA and TAT (temporal aggregation forms
creating constant intervals). 13
Figure 2.5 Overview of the different aspects of a temporal model. 15
Figure 2.6 The fall property using a discrete (left) and continuous
(right) temporal model. Within the discrete chart, the
diamonds mark the value of the property and the
triangles illustrate the indivisible delta between the
previous and the current time point. 16
Figure 2.7 The item property using a discrete (left) and continuous
(right) temporal model. Within the discrete chart, the
diamonds mark the value of the item property and the
triangles illustrate the indivisible delta between the
previous and the current time point. 17
Figure 2.8 Example of a mapping between data of a circular
temporal model to a linear temporal model. 19
Figure 2.9 Selection of a time window from an unbounded
temporal model to be presented and analyzable in
a bounded temporal model. 20
Figure 2.10 Overview of Allen’s (1983) temporal operators. 20
XX List of Figures
Figure 2.11 Illustration of the ambiguousness of Allen’s (1983)
temporal operators. 21
Figure 2.12 Examples of commonly used temporal concepts. 22
Figure 2.13 Example of the impact of different time zones within the
scope of temporal analytics. 24
Figure 2.14 Illustration exemplifying the error of calculating
statistical values, e.g., the amount of intervals per hour. 25
Figure 2.15 Overview of selected features defined in the category
descriptive analytics in the context of time interval data
analysis (cf. Table 2.1). 33
Figure 2.16 The data science process following Schutt, O'Neil
(2014). 36
Figure 2.17 The result of the workshops regarding the time interval
data analysis process. 38
Figure 3.1 Examples of the different types of hierarchies
(non-strict, non-covering, and non-onto). 48
Figure 3.2 Two examples of the summarizability problem. 49
Figure 3.3 Illustration of a scenario covered I-OLAP as presented
by Koncilia et al. (2014). 51
Figure 3.4 Examples of the visualization techniques Cluster
Viewer (van Wijk, van Selow 1999) and GROOVE
(Lammarsch et al. 2009). 55
Figure 3.5 Example of a bitmap-index containing three bitmaps,
one for each possible value (i.e., red, green, and
yellow) of the color-property. 58
List of Figures XXI
Figure 3.6 Illustration of the question to be answered by the
query: "How many resources are needed within
each hour of the first of January 2015?" 63
Figure 3.7 Comparison of the result of the query from a system
supporting non-strict relationships (right) and one that
does not (left). 64
Figure 3.8 The ARTEMIS distance calculated for two interval-sets
S and T. 68
Figure 3.9 The DTW distance calculated for two interval-sets S
and T. 69
Figure 3.10 Example of the IBSM distance calculated for two
interval-sets S and T. 70
Figure 4.1 Illustration of a time-axis = (time,minute). The
incoming data, i.e., timestamps (in milliseconds)
between 2000-01-01 00:00:00.000 and 2099-12-31
23:59:59.999 from the time zone CET, are mapped
to values 1-10 representing minutes. 76
Figure 4.2 Example of a descriptor dlang = (lang, lang, lang), which
uses an identity function to map the set of languages,
i.e., the descriptive values, to the descriptor values. 80
Figure 4.3 An example of a time interval database = (data, time,
team, department). The database contains tasks
performed by teams (a team consists of several team
members) and for the specified department. 82
Figure 4.4 Example of two descriptor hierarchies. The one on the
left is based on the descriptor values specified by country
and the one on the right is based on city. The example
shows a non-strict (left) and a non-covering hierarchy
XXII List of Figures
(right). Both hierarchies are valid regarding the
definition of descriptor hierarchies. 84
Figure 4.5 Example of implicit information recognized for the
timestamp 2000-01-06 13:00 CET and the validity of
the information when rolling up a hierarchy. 85
Figure 4.6 Example of implicit information recognized for the
timestamp 2000-01-06 13:00 CET and the validity of
the information when rolling up a hierarchy. 86
Figure 4.7 Illustration of the TIDAMODEL showing all defined ele-
ments. 88
Figure 5.1 Illustration of the provided temporal operators and
there corresponding temporal relation. 103
Figure 5.2 Sample dimension showing one of two hierarchies with
three levels. 106
Figure 5.3 Usage of the query language features ON and
GROUP BY to enable roll-up and drill-down operations. 109
Figure 6.1 Overview of the different types of similarity types,
presenting an equality example for each type of
measure. 112
Figure 6.2 Illustration of two different matching strategies, i.e.,
weekday and order match. 113
Figure 6.3 Example of assignments of relations to time points
using Allen's (1983) relations. 116
Figure 7.1 The architecture of the information system showing
the high-level components. 122
Figure 7.2 Detailed architecture of the data repository component. 126
List of Figures XXIII
Figure 7.3 Illustration of the subcomponents of the main
component Cache & Storage. 128
Figure 7.4 The complete package of the DbDataRetriever
extension used to load data from a database. 133
Figure 7.5 Illustration of the first three levels (from bottom to top)
of the hierarchy defined in Listing 7.7. 139
Figure 7.6 Illustration of the hierarchy defined in Listing 7.8. 140
Figure 7.7 Three different time axis configurations and an
illustration of the internal representation as array. 151
Figure 7.8 Illustration of the algorithm used to map descriptive
values, e.g., [flu, cold] to the descriptor values flu and
cold. 154
Figure 7.9 Example of a result of the processing of a raw data
record. 155
Figure 7.10 Illustration of the index structure (HashMap) used by
the descriptors index (cf. Goodrich, Tamassia (2006)). 157
Figure 7.11 The different tasks (filtering, partitioning, and
aggregating) to be performed to handle an analytical
query. 158
Figure 7.12 The data descriptor index, using by default a HashMap
and a high performance collection (Trove) to index
bitmaps. 160
Figure 7.13 Example of the structure of the fact descriptor index,
associating facts with descriptor values. 161
Figure 7.14 An example database with data related indexes. 163
XXIV List of Figures
Figure 7.15 Illustration of the group bitmap calculation, in the case
of the usage of a dimension’s level within the group by
expression. 165
Figure 7.16 The four resulting bitmaps for the different chronons
and groups. 168
Figure 7.17 Illustration of TAT and STA. 171
Figure 7.18 Illustration of the abort criterion for the temporal order
and measure distance. 173
Figure 7.19 Illustration of the algorithm used to determine the
relations between intervals. 174
Figure 7.20 Overview of the user console of the implemented UI:
top-left shows the login screen, top-right is a screenshot
of the model management, middle-left is a picture of the
data management, middle-right illustrates the user man-
agement, and the screenshots on the bottom show the
time series visualization (left) and the Gantt-chart
(right). 177
Figure 8.1 The results of the tests regarding the high performance
collections for int and long data types. 188
Figure 8.2 The results of the load performance tests. 190
Figure 8.3 The results of the selection tests for the different
queries shown in Table 8.3. 195
Figure 8.4 Illustration of the performance tests regarding the
distance calculation, as well as the results of the
temporal order and measure similarity; a visualization
of the relational similarity can be found in the appendix. 197
Figure 8.5 Performance results of the queries used to answer the
questions shown in Table 8.4. 201
List of Tables
Table 2.1 Overview of the features requested in the category de-
scriptive analytics. 31
Table 2.2 Overview of the features requested in the category
predictive analytics. 34
Table 2.3 Overview of the features requested in the category
prescriptive analytics. 35
Table 2.4 List of requested features for the information system
considering data collection. 39
Table 2.5 List of requested features for the information system
considering data integration & cleansing. 40
Table 2.6 The features required to support the application of
models and analytical algorithms. 42
Table 2.7 Overview of the features requested for the UI,
visualization, and user interaction. 42
Table 5.1 Overview of the seven criteria used as basis for design
decisions regarding a query language. 91
Table 6.1 Overview of the time points calculation for a specific
relation. 116
Table 7.1 Results of the default temporal mapping algorithm,
assuming the top time axis definition of Figure 7.7. 152
Table 7.2 Examples of different group-bitmaps created for
specific GROUP BY expressions based on the
example database shown in Figure 7.14. 164
Table 7.3 List of algorithms used to calculate the different
aggregated values. 169
XXVI List of Tables
Table 8.1 Overview of the different features requested, the
realization of the feature, as well as comments of the
users (if available), and the degree of realization. 182
Table 8.2 List of algorithms used to calculate the different
aggregated values. 187
Table 8.3 Overview over the different tests performed to
validate the runtime performance. 193
Table 8.4 List of tests performed in the category "Proprietary
Solutions vs. TIDAIS". 200
List of Listings
Listing 3.1 MDX statement used to answer the question regarding
the needed resources. 63
Listing 3.2 ATSQL2 statement used to answer the question
regarding the needed resources. 65
Listing 3.3 SQL statement used to answer the question regarding
the needed resources. The presented solution is based
on additional PL/SQL functions and data types which are
shown in the appendix (cf. Pipelined Table Functions
(PL/SQL Oracle)). 66
Listing 3.4 The TIDAQL statement used to answer the question
regarding the needed resources. 67
Listing 5.1 Syntax of statements using the ADD command of the
DCL to add a user or a role. 93
Listing 5.2 Syntax of statements of the DCL, used to drop a user
or a role. 93
Listing 5.3 Syntax of the statements using the commands
MODIFY, GRANT, and REVOKE. 94
Listing 5.4 Syntax of statements for the commands ASSIGN and
REMOVE, used to modify the roles assigned to a user. 94
Listing 5.5 Syntax of statements using the LOAD, UNLOAD, and
DROP commands of the DDL. 95
Listing 5.6 Syntax of statements using the INSERT command
of the DML. 97
Listing 5.7 Syntax of the statement to enable or disable bulk load
for a model. 99
XXVIII List of Listings
Listing 5.8 Syntax of the statement to delete a specified record
from a model. 99
Listing 5.9 Syntax of statements using the UPDATE command
of the DML. 99
Listing 5.10 Syntax of statements using the GET command of the
DML. 100
Listing 5.11 Syntax of the select statement to retrieve time series
of a specified time window. 101
Listing 5.12 Syntax of the select statement to retrieve time
interval records from the information system. 102
Listing 5.13 Syntax of the select statement to retrieve analytical
results from the information system. 104
Listing 7.1 The skeleton of a model-configuration-file of the
information system. 130
Listing 7.2 Configuration of a data retriever within a model. 131
Listing 7.3 Configuration of a dataset and the structure of the set. 132
Listing 7.4 XSLT template used to create the bean used by the
DbDataRetriever to define the query. 133
Listing 7.5 An excerpt of a configuration defining three descriptors
and descriptor values for one of the descriptors. 135
Listing 7.6 An example of a configuration of the time axis. 136
Listing 7.7 A sample definition of a time hierarchy within the
time dimension. 138
Listing 7.8 A sample definition of a hierarchy of the descriptor
WORKAREA. 140
List of Listings XXIX
Listing 7.9 A pre-processor configuration using the
ScriptPreProcessor. 141
Listing 7.10 A configuration specifying three sample schedules. 142
Listing 7.11 Example of a configuration of caches for all entities
of the system. 143
Listing 7.12 An example configuration of the default IndexFactory,
specifying the implementations used to index specific
data types. 144
Listing 7.13 The skeleton of a configuration-file of the information
system. 145
Listing 7.14 A sample configuration of the Authentication &
Authorization component. 146
Listing 7.15 Example of the system configuration of the Service
Handler component. 147
Listing 7.16 Example of the system configuration of the Query
Parser & Processor component. 147
Listing 7.17 Example of the system configuration to add an
additional template. 148
Listing 7.18 The pairing function used to determine a unique
identifier for a pair of intervals. 175
Listing 8.1 The naïve algorithm. 191
Listing 8.2 The IntTreeB algorithm. 192
List of Definitions
Definition 1 TIDAMODEL 73
Definition 2 Valid time points, chronon, and data time points 73
Definition 3 Temporal mapping function 74
Definition 4 Granularity 75
Definition 5 Time axis 75
Definition 6 Descriptive attribute and descriptive value 76
Definition 7 Set of and descriptor value 77
Definition 8 Descriptive mapping function 78
Definition 9 Fact function (value-invariant, record-invariant,
record-variant) 79
Definition 10 Descriptor 79
Definition 11 Time interval 80
Definition 12 Time interval dataset and time interval record 81
Definition 13 Time interval database 81
Definition 14 Descriptor dimension, hierarchies, levels, and members 83
Definition 15 Time dimension, hierarchies, levels, and members 87
Definition 16 Dimensions 87
Definition 17 Temporal Order Distance 114
Definition 18 Temporal Relational Distance 117
Definition 19 Temporal Measure Distance 117
Definition 20 Temporal Similarity Measure 118
1 Introduction and Motivation
The process of analyzing data has raised increased attention in recent
years. Data analysis techniques are used to recommend articles to us-
ers, predict the outcome of elections, or understand causes. Over the
last years, discussions with industrial partners and feedback from sev-
eral companies showed that the analysis of time interval data created
various problems, across different domains. Thus, the focus of this
book is on an information system capable to analyze a specific, con-
tent-independent type of data; time interval data1.
To understand the issues arising when using available, proprietary sys-
tems and to understand the requirements posed by analyst regarding an
information system to analyze time interval data, several workshops with
analyst from different domains were held over the last years. The users
participated were dealing with time interval data on a daily basis in different
domains, e.g., aviation (e.g., KLM, Delta Airlines, Lufthansa, Bologna Air-
port, or Düsseldorf Airport), logistics (e.g., DHL, FedEx, or Dnata), call cen-
ters, and hospitals (e.g., university hospital Aachen, Bonn, or Düsseldorf),
as well as linguists (e.g., experts from the RWTH University, the Centre for
Research and Innovation in Translation and Translation Technology in Den-
mark, or the VU Amsterdam University) and production workers (e.g., Audi,
Continental, or Porsche). The results of these workshops indicate that a
need for an information system to analyze time interval data is present and
that the main reasons, why available systems are not suitable, are:
– unsupported handling of temporal aspects (e.g., time zones, temporal
relations, or daylight saving time),
– performance issues (e.g., analyzing millions of intervals or using a low-
est granularity of seconds),
1 source-code: https://github.com/pmeisen, binary-version: http://tida.meisen.net
© Springer Fachmedien Wiesbaden GmbH 2016P. Meisen, Analyzing Time Interval Data,DOI 10.1007/978-3-658-15728-9_1
2 1 Introduction and Motivation
– limitations of available modeling capabilities (e.g., unsupported many-
to-many relations, unavailable aggregation functions like median, or
complex measures),
– unsustainable and expensive data integration processes (e.g., creating
enormous amounts of redundant data or discretizing the intervals), and
– faulty results (e.g., incorrect aggregation outcomes).
Over the past years and decades, several disciplines like data mining
(Moerchen 2009; Laxman, Sastry 2006), artificial intelligence (Allen 1983),
music (Bergeron, Conklin 2011), medicine (Combi et al. 2007; Aigner et al.
2012), finance (Arroyo et al. 2010), ergonomics (Boonstra-Hörwein et al.
2011), or cognitive science (Berendt 1996) have presented general or ap-
plication specific techniques or methods dealing with time interval data2. In
simple terms, a time interval is given by two time points on an underlying
time axis, i.e., [t1, t2] with t1 ≤ t2. Time interval data is recorded, collected, or
generated in various situations and industrial fields e.g. workload retrieved
from the records of man hours, tasks planned in a project, actions executed
during a process, or event intervals noticed during an observation.
In general, analyzing is defined as "a careful study of something to learn
about its parts, what they do, and how they are related to each other" (Mer-
riam-Webster 2015). Current research concerning the analysis of time in-
tervals addresses specific problems like pattern3 or association rule mining
(Winarko, Roddick 2007; Papapetrou et al. 2009; Sadasivam, Duraiswamy
2013), comparison (Kostakis et al. 2011; Kotsifakos et al. 2013), visualiza-
tion and interaction (Aigner et al. 2011; Heuer, Jr., Pherson 2014), model-
ing (Koncilia et al. 2014; Meisen et al. 2014), or pre-processing (Kimball,
Ross 2002). Some of the techniques or methods consider the fact of han-
dling time interval data (instead of just interval data) to motivate the usage
2 Some literature refer to the term time intervals also as temporal intervals, event-intervals, interval-based events, time segments, time ranges, time periods, interval-based data, tasks or activities.
3 In some literature a pattern of time interval sequences is defined as an arrangement.
1 Introduction and Motivation 3
of a temporal semantic (e.g. Allen’s scheme (Allen 1983)) which is im-
portant so that terms like coincidence or synchronicity are well-defined.
Others use statistics like aggregated facts (e.g. yearly population, average
monthly temperatures, or yearly energy consumption per industrial sector)
from temporal data to enable a comparison between different days, months
or years to measure quality (e.g., using key performance indicator).
A holistic solution, like an information system, addressing the problem
of analyzing time intervals, has to consider aspects of modeling and per-
sistence, visualization and interaction, comparison, aggregation and min-
ing. An analyst must be able to e.g. visualize and compare results, select
specific intervals, or find typical matches and discrepancies. The system
has to handle the time interval data in such a way that a first result (e.g., in
form of a trend or projection) is calculated fast and thereby can be modified
early by the analyst if needed. Furthermore, time-aspects, which may be
considered to be irrelevant or just not recognized by context-free generic
algorithms, must be taken into account. Those aspects could be e.g. holi-
days, time zones, or daylight saving, but also vacation periods, leap years,
calendar weeks which do not fit neatly into months or years, or the usage
of a financial instead of calendar year. Summarized, it can be stated that
an information system has to enable the analyst to get answers to ques-
tions and point out possible interest which arise across the whole analyzing
process. Depending on the context of the analysis it also has to support
the generation of generic representations (e.g., detect patterns), or the
comparison of a set of time intervals using a specified distance measure
(e.g., complex search).
More specifically, the following research questions (RQ) are the focus
of this book:
1. Which features must be supported by an information system to enable
time interval data analysis?
2. Which aspects must be covered by a time interval data analysis model
and how can it be defined?
4 1 Introduction and Motivation
3. How can a query language for the purpose of analyzing time interval
data, i.e., select, filter, aggregate, generalize, or specialize be formu-
lated?
4. Which indexing techniques can be used to process user queries and
how should data be cached, as well as persisted?
5. What similarity measure can be used to compare time interval da-
tasets, enabling the search for similar subsets?
6. How should the architecture of an information system for time interval
data analysis be realized, how should the system be configured, and
which interfaces have to be provided to support the analyzing process?
The questions arose during the studies, implementation, and realization of
the introduced information system are used as a guideline. Each question
will be addressed and answered within this book. The book is structured as
follows: Chapter 2 describes the term of time interval data analysis by in-
troducing several characteristics of and terminologies used in the context
of time (cf. section 2.1). In addition, the chapter presents requirements for
and the derived features of an information system demanded by analysts
dealing with time interval data on a regular basis. Furthermore, these re-
quirements and features are used to identify different research areas, im-
portant to be examined in the context of time interval data analysis (cf. sec-
tion 2.2). Chapter 3 reflects the state of the art of the identified research
areas, i.e., proven architectures used for information systems in the context
of data analysis (cf. section 3.1), different approaches applied in data anal-
ysis (cf. section 3.2), indexing and aggregation of time interval data (cf.
section 3.3), as well as similarity and comparison of sets (cf. section 3.5).
The chapters 4, 5, 6, and 6.2 present the aspects relevant to create an
information system to analyze time interval data. These aspects are: the
defined model TIDAMODEL, the query language TIDAQL, the similarity
measure TIDADISTANCE, and selected parts (e.g., the architecture) of the
realized information system TIDAIS. Each chapter is divided in multiple sec-
tions, discussing the important characteristics and results of the chapter’s
topic. Moreover, the different research questions mentioned previously in
1 Introduction and Motivation 5
this chapter are answered. The presented solutions are evaluated and dis-
cussed in chapter 7.5. In section 8.2, the performance of different imple-
mentations is evaluated. In addition, the presented solution is evaluated
regarding the defined set of features (cf. section 8.1), its performance (cf.
section 8.2), and compared to commercial solutions (e.g., database man-
agement systems (DBMS) or business intelligence solutions (BI)). The
book concludes with an outlook in chapter 9.
2 Time Interval Data Analysis
This chapter is structured as follows: Section 2.1 introduces terms and tem-
poral aspects relevant to be considered when analyzing time interval data.
In section 2.2, the different features required by an information system are
discussed. The introduced terms, temporal aspects, and presented fea-
tures are results of several workshops with users from different domains
(e.g., service providers like ground-handlers, airlines, call centers, and hos-
pitals, as well as linguists and production workers) and aligned with an ex-
tended literature research. The chapter is completed with a summary in
section 2.3.
2.1 Time
When referring to time within the context of information systems and ana-
lytics, it is necessary to utilize a temporal framework. A temporal framework
defines how time is represented (i.e., temporal models, section 2.1.3), how
time can be used (i.e., temporal operators, section 2.1.4), and which se-
mantic is applied (i.e., temporal concepts, section 2.1.5). In addition, con-
straints and limitations are implicitly defined within a temporal framework,
i.e., circumstances which cannot be formalized are assumed to be invalid.
In order to motivate a temporal framework in the context of time interval
data analysis, section 2.1.1 introduces the term time interval informally (a
formal definition is given in section 4.3) and in section 2.1.2 the aggregation
of time intervals is presented, which is the predominant operator in the field
of data analysis (cf. section 2.2.1 and section 7.3.4). Lastly, special char-
acteristics of time like leap years, daylight saving, or time zones are dis-
cussed in section 2.1.6.
2.1.1 Time Intervals
A time interval can be specified by two endpoints (e.g., tstart and tend, with
tstart ≤ tend). Generally, the interval’s endpoints can be included or excluded,
denoting the former by rounded and the latter by square brackets. As an
© Springer Fachmedien Wiesbaden GmbH 2016P. Meisen, Analyzing Time Interval Data,DOI 10.1007/978-3-658-15728-9_2
8 2 Time Interval Data Analysis
example, the denotation [10:00, 12:12) is used to specify all time points
between 10:00 (included) and 12:12 (excluded). In real life, time intervals
are used to express the validity of, e.g., an observation, a state, or of a
more complex situation, over a period of time:
– The red apple with a weight of 250.00g was falling from the tree be-
tween 09:45:12 and 09:45:57.
– The accused was out on bail from the first of January 2015 until the
fifth.
– The machine only produced 16 items between 09:00 and 12:28, even
though it could have produced 25.
– The translator typed the word ‘treasure’ and looked up the word
‘Schatzinsel’ within two minutes.
Looking at these sentences reveals some peculiarities to be considered
when working with time intervals. For example, it may be impossible to tell
if the endpoints are in- or excluded or if they are absolute (e.g., 01/01/2015)
or relative (e.g., "within two minutes"). In addition, the granularity used to
express an endpoint may differ (e.g., 09:00 uses a minute granularity,
whereas the granularity of 09:45:12 is seconds). Furthermore, the exam-
ples indicate that the provided information used to describe can vary (e.g.,
"red apple" as categorization vs. "16 items" as fact). Figure 2.1 illustrates a
first example of a time interval and different types of associated infor-
mation.
Figure 2.1 Apple falling from tree, example of a time interval and associated information observed, measured or calculated during the process of an apple falling from a tree.
2.1 Time 9
The example shown in Figure 2.1 illustrates an observation which
started at 09:45:12 and ended at 09:45:57 (i.e., a time interval of [09:45:12,
09:45:57]). During (or after) the observation the properties color, class,
weight, fall, and duration were measured. Without providing a formal clas-
sification at this point (c.f. section 4.2), it is noticeable that properties may
have to be handled differently from a semantical and analytical point of
view. For example, the property color can be of interest when filtering,
whereas the property class may be useful to determine a price, which can
be important when aggregating. Other interesting properties are those
which are not constants within the interval, e.g., the property fall is not con-
stant. The presented value of 1.00 m is only valid for time points t ≥ tend. For
time points tstart > t > tend, the property’s value can be calculated using the
formula fall = ½ · g · (t – tstart)2 and for t ≤ tstart the value is 0.00 m.
Another example is shown in Figure 2.2. The example illustrates tasks
(i.e., time intervals) performed by a machine. Such an example can
typically be found in production environments.
Figure 2.2 Machine performance, example of a time interval and associated in-formation observed, measured, or calculated during the execution of a task by a machine.
The time interval of the machine performance example uses, compared
to the previously discussed apple falling from tree example, a minute gran-
ularity for the time interval, i.e., [09:00, 12:30]. The example defines four
10 2 Time Interval Data Analysis
properties associated to the time interval: machine, items, maximal capac-
ity, and needed resources. The items property is not constant (i.e., the
value changes during the interval), whereby the maximum capacity prop-
erty may be assumed to be constant (e.g., when filtering) or not (e.g., when
used to calculate the utilization of the machine over time). In addition, the
needed resources property is of special interest regarding aggregation. As
introduced further in section 3.2.1 and discussed in more detail in section
7.3.4, this property can lead to summarizability problems if not aggregated
correctly (Lenz, Shoshani 1997; Song et al. 2001; Mazón et al. 2008). The
reason lies in the indivisibility of the value, i.e., the value is 4 for every time
point of the interval but it is still 4 even if several time points of the interval
are selected (i.e., summarizability is not given).
Within the next sections, the introduced examples are used to exemplify
time interval data aggregation and are used to motivate the usage and ex-
emplify the impact of temporal models, concepts and operators.
2.1.2 Time Interval Data Aggregation
Data aggregation is the predominant operation in the field of data analysis
(Zhang et al. 2008). Aggregating time interval data is more difficult than the
aggregation of time point data. The reasons lie above all in the intricate
semantic (cf. section 2.1.4), e.g., an interval expresses typically the validity
of a fact or description over a period of time. When aggregating intervals
within a specified time window several questions have to be answered, e.g.,
"Should the time window be partitioned" (e.g., using a time window of a
year, it may be needed to aggregate data by day) or "What is the semantic
meaning of the aggregation and does it fulfill the expectation" (e.g., is count
a useful aggregation to determine the needed resources within a time win-
dow). In literature, different forms of temporal aggregations are introduced
in the field of temporal databases and data analysis, i.e., Instant Temporal
Aggregation (ITA), Moving-Window Temporal Aggregation (MWTA), Span
2.1 Time 11
Temporal Aggregation (STA), General Temporal Aggregation (GTA) (Böh-
len et al. 2008), as well as the Two-step Aggregation Technique (TAT)
(Meisen et al. 2015b).
When aggregating time interval data, the set of intervals to be grouped
is defined by the values of specified properties (e.g., the color of the apple
in the apple falling from tree example (cf. Figure 2.1)) and, in addition, by a
temporal grouping criterion (e.g., month, day or hour) used to partition the
time axis. Depending on the form of temporal aggregation, the returned
result of a query might contain so called constant intervals (ITA, MWTA,
and GTA) or fixed partitions (STA, TAT, and GTA). A constant interval is an
interval in which the aggregated value4 is constant, i.e., consecutive time
partitions are coalesced. Conversely, a fixed partition is defined by the
specification of the aggregation (e.g., group by month) and the result con-
tains a value for each partition (e.g., each month).
Figure 2.3 illustrates the ITA and MWTA forms, both returning constant
intervals. In the figure, the intervals are grouped by the machine property,
i.e., two groups are identified: furnace and machine. Furthermore, the time
axis is on month granularity, and the example counts the amount of ma-
chines per month. As mentioned, ITA and MWTA both create constant in-
tervals. Thus, in the case of ITA the results contains, e.g., the constant in-
terval [3, 5] for the value 2. On the other hand, MWTA uses a defined time
window [t – w, t – w’] for each instance t of the defined temporal grouping
and determines the set of intervals to be grouped. Thus, the example illus-
trated in Figure 2.3 calculates the aggregated values for the impeller group
and the different time windows are, e.g.: count([1, 2]) = 1, count([2, 3]) = 2,
count([3, 4]) = 2, …, count([11, 12]) = 1, and count([12, 12]) = 0. The cre-
ated constant values are shown in the table of the figure.
4 Some implementations consider lineage information, i.e., the implementation validates if
the resulting aggregated value is based on the same time intervals (cf. Böhlen et al. 2008).
12 2 Time Interval Data Analysis
Figure 2.3 Example of ITA and MWTA (temporal aggregation forms creating constant intervals).
In general, ITA uses the defined temporal grouping criterion to deter-
mine the set of intervals for a specific group. On the other hand, MWTA
uses a defined time window [t – w, t – w’] for each instance t of the defined
temporal grouping and determines the set of intervals to be grouped. Thus,
using MWTA with w = 0 and w’ = 0 leads to the same results as ITA pro-
vides. Empty groups are typically not included within the result (e.g., cf.
Figure 2.3: (impeller; 0; [12, 12]) and (furnace; 0; [12, 12]) are not included;
Snodgrass (1995), Böhlen et al. (2000)).
In contrast to ITA or MWTA, the application of STA or TAT leads to fixed
partitions. Consequently, the result contains one aggregated value for each
instance of the temporal grouping specified, if at least one time interval
overlaps with the instance. It depends on the chosen implementation, if the
2.1 Time 13
result contains empty groups or not. Meisen et al. (2015b) present a bit-
map-based implementation for TAT which ensures that the result contains
all empty groups. Regarding STA, empty groups are not included referring
to Snodgrass (1995) and Böhlen et al. (2000). Figure 2.4 illustrates STA
and TAT. As exemplified, STA determines the set of intervals for each in-
stance within the specified temporal grouping criterion (i.e., instance [1, 6]
overlaps with two intervals, whereas [7, 12] overlaps with three). The same
result could be achieved using TAT with a count operator. Within the exam-
ple shown in Figure 2.4, TAT applies the max-count operator. Thus, the ag-
gregated value of count is determined for each instance of the lowest gran-
ularity of the underlying time axis (i.e., for each chronon, cf. section 2.1.3).
Next, the results of each month are aggregated using the maximum oper-
ator (i.e., max). Therefore, the result for [7, 12] is 2 (i.e., max({2, 2, 2,
2, 2, 1})) instead of, compared to STA, 3.
Figure 2.4 Example of STA and TAT (temporal aggregation forms creating con-stant intervals).
14 2 Time Interval Data Analysis
The earlier mentioned, but so far not further discussed, GTA is a gen-
eralized framework for temporal aggregation accommodating ITA, MWTA,
and STA, as well as partly TAT. Generally, the framework allows specifica-
tion of any kind of partition over the time axis. In addition, it is possible to
define mapping functions in order to manipulate the instances of the parti-
tion. The framework covers TAT only partly because it only allows the defi-
nition of one aggregation function. Nevertheless, considering GTA, several
challenges have not been solved. In addition, GTA is a theoretical definition
which "offers a uniform way of expressing concisely the various forms of
temporal aggregation" and "does not imply an efficient implementation"
(Böhlen et al. 2008).
Temporal aggregations are discussed within this book several times:
section 2.2.1 introduces features which are required regarding temporal
aggregation, section 3.2.1 discusses the usage of temporal aggregators,
as well as summarizability problems. Chapter 5 introduces a query lan-
guage supporting the usage of temporal aggregations.
2.1.3 Temporal Models
In literature about time, various temporal models have been proposed to
represent physical time. Generally it can be stated that physical time can
be modeled as discrete, dense, or continuous (Dyreson et al. 1994; Hudry
2004). In addition, literature introduces other aspects namely linear,
branching, or circular temporal models, as well as bounded or unbounded
temporal models (Frühwirth 1996). Within this section, the different aspects
of a model will be introduced and discussed in matters of time interval data
analysis. Also, the usage of a discrete, linear, bounded temporal model in
the context of time interval data analyses is motivated. Figure 2.5 depicts
the different temporal models which are introduced in detail in this section.
2.1 Time 15
Figure 2.5 Overview of the different aspects of a temporal model.
Discrete, Dense, and Continuous Temporal Models
A discrete time implies that a point in time can be represented by an integer
(i.e., time is isomorph to the natural numbers). If a dense or continuous
temporal model is used, it infers that another time point between any two
‘unequal’ time points exists (i.e., time is isomorph to the rational or real
numbers)5. To understand the impact of the decision of which temporal
model to use, it is necessary to understand the main differences between
the different models considering the context of analyzing time interval data.
Because of the isomorphic behavior of dense and continuous temporal
models and the fields of application concerning dense temporal models
(i.e., mainly model checking), the following discussion will discuss the us-
age of a discrete or continuous temporal model, whereby dense temporal
models are – regarding the argumentation – ‘covered’ by the latter.
5 As stated by Hudry (2004), a dense temporal model is isomorphic to rational numbers,
whereby a continuous temporal model is isomorphic to the real numbers. In the context of analytics this differentiation is not important and is therefore not further mentioned.
16 2 Time Interval Data Analysis
To illustrate the differences between the temporal models, the apple
falling from tree example (cf. Figure 2.1) is used. Applying a discrete tem-
poral model to the example would let the apple ‘fall in steps’, i.e., at each
discrete time point the apple would have a different falling distance, i.e., the
fall property’s value would be different. The model would not clarify the ap-
ple’s position ‘in between’ two directly successive time points because in a
discrete temporal model something between two directly following time
points does not exist. Thus, within a discrete temporal model the falling
distance of the apple would be specified for each discrete time point of the
interval (e.g., at tend the apple’s falling distance is 1.00 m). Furthermore, it
would be possible to calculate an indivisible delta, which would be specified
by the absolute value of the difference of the falling distance of two directly
successive time points. Using a continuous temporal model, the falling dis-
tance would be specified for every moment t (using ½ · g · (t – tstart)2). A
delta between two time points can still be calculated but within such a
model the delta is not indivisible. Figure 2.6 illustrates the falling distance
in a discrete and continuous temporal model and shows the indivisible delta
calculated for the discrete case (triangles).
Figure 2.6 The fall property using a discrete (left) and continuous (right) tem-poral model. Within the discrete chart, the diamonds mark the value of the property and the triangles illustrate the indivisible delta be-tween the previous and the current time point.
2.1 Time 17
Regarding the apple falling from tree example, it may be intuitive to say
that the information available when using the continuous temporal model
is more precise. Nevertheless, looking at the machine performance exam-
ple and the items property, this intuition may be different. Figure 2.7 shows
the results recorded from an employee who checked the amount of created
items every 15 minutes using both the discrete and the continuous tem-
poral model.
Figure 2.7: The item property using a discrete (left) and continuous (right) tem-poral model. Within the discrete chart, the diamonds mark the value of the item property and the triangles illustrate the indivisible delta between the previous and the current time point.
In this example, the information provided by the continuous model is too
precise. Depending on the used function (e.g., if interpolation is used) it
may even be invalid6. From an analytical point of view, one may argue that:
‘as long as the granularity of a discrete time-axis is selected correctly, the
discrete temporal model is at least as good as the continuous one’. In ad-
dition, it has to be considered that data is typically collected by sensors
(using a discrete sampling rate). Thus, the measured data is discrete and
6 Figure 2.7 allows for the conclusion that the value of t = ½ · (t2 – t1) is 0.5. Such an invalid
value can be avoided by using a piecewise-defined continuous function. Nevertheless, from a domain-specific point of view, the correctness of the value is still not guaranteed because the employee did not check the amount at every time point.
18 2 Time Interval Data Analysis
the use of a continuous model is unnecessary. It should also be mentioned
that a continuous property (e.g., a value based on a mathematical function)
can be easily transformed into a discrete property using discretization tech-
niques (Liu et al. 2002). Another aspect that should be considered when
reaching a decision regarding a temporal model is the context. State of the
art indicates that analyses dealing with temporal data are mostly based on
discrete temporal models (cf. section 3.2).
As a result of these conclusions, the temporal model used within this
book is discrete. Thus, the time axis consists of a finite number of chronons
(i.e., "a nondecomposable [indivisible, remark of author] time interval of
some fixed, minimal duration" (Dyreson et al. 1994, p. 55)).
Linear, Branching, and Circular Temporal Models
Another aspect of temporal models addresses the future. Within a linear
temporal model only one future is assumed, whereby a branching temporal
model allows the existence of at least one but also multiple futures (paths).
Moreover, a circular temporal model defines the future to be recurring. In
the majority of cases regarding temporal data analysis, a linear temporal
model is used. This is plausible because of the temporal concepts and op-
erators mostly used within the field. If a branching or circular temporal
model is utilized, simple concepts like before, or after may be difficult to be
applied7. Thus, within this book a linear temporal model is assumed.
It should be mentioned that most data based on a circular temporal
model can be pre-processed to fit a linear temporal model. If, e.g., data is
retrieved from a simulation which is based on a circular temporal model it
is necessary to ‘roll out’ the circular time, i.e., map time intervals of the
circular time to time intervals of the linear time as indicated in Figure 2.8.
The figure depicts a circular temporal model of a week and data generated
in five iterations. The applied mapping links each circular week (i.e., each
week of each iteration) to a week of the linear time.
7 For discussions within other research areas the interested reader is referred to Alur, Hen-zinger (1992), Frühwirth (1996), Hudry (2004), and Ossimitz, Mrotzek (2008).
2.1 Time 19
Figure 2.8: Example of a mapping between data of a circular temporal model to a linear temporal model.
Bounded and Unbounded Temporal Models
The discussion about bounded or unbounded temporal models are, in the
context of data analysis, more or less philosophical. A bounded temporal
model is a model which has a defined start (i.e., a smallest time point) and
a defined end (i.e., a greatest time point). Within an unbounded temporal
model, infinitive time points are allowed, i.e., the interval
[01.01.2015 09:00, ∞] is infinitive considering its end. If data from an un-
bounded temporal model should be analyzed it implies that there is no be-
ginning or ending of time, i.e., there is always a time point earlier or later.
Analyzing data within such a model would mean that unlimited data is avail-
able (i.e., defined by a discrete or continuous function); if not, limited data
can be analyzed by the bounded temporal model by using the minimal and
maximal time point of the limited data as boundaries. Nevertheless, unlim-
ited data which is, e.g., defined by a recursively defined discrete function,
could be analyzed within a time window which defines the boundaries used
for the bounded temporal model (as illustrated in Figure 2.9).
20 2 Time Interval Data Analysis
Figure 2.9: Selection of a time window from an unbounded temporal model to be presented and analyzable in a bounded temporal model.
Taking into consideration the above-mentioned findings, a bounded
temporal model is used within this book.
2.1.4 Temporal Operators
A temporal operator for time intervals expresses the relation between typi-
cally, but not exclusively, two intervals. Within the last decades, several tem-
poral operators were defined (cf. Moerchen (2009) for an extensive over-
view). In the majority of cases, the temporal operators of Allen (1983) are
used. The primary reason for this is that the list of 13 defined operators is
complete regarding possible combinations. Figure 2.10 depicts the defined
operators.
Figure 2.10: Overview of Allen’s (1983) temporal operators.
2.1 Time 21
Nevertheless, Moerchen (2009) states that Allen is not robust considering
small changes and ambiguous regarding one’s intuition. The first point can
be ignored if exact boundaries are requested. However, the latter point
mentioned refers to the problem that the size of overlaps or gaps is not
taken into account using Allen’s relations. Figure 2.11 illustrates the con-
cerns mentioned by Moerchen. The relation between the intervals A and B
are considered to be equal to C and D (both overlap). The same problem
can be observed by looking at the relation between the intervals E and F
and the relation between G and H which are both considered to be equal.
Figure 2.11: Illustration of the ambiguousness of Allen’s (1983) temporal opera-tors.
As already mentioned, several other temporal operators were published
over the last decades. These other approaches mainly focus on overcom-
ing the problems of Allen’s definition regarding robustness and ambiguous-
ness. Some try to achieve that by adding additional relations (e.g., Roddick,
Mooney (2005) which define a total of 49 relations of which nine are differ-
ent types of overlaps), others split intervals to generate partial relations (cf.
Moerchen (2006a); Moerchen, Fradkin (2010); Peter, Höppner (2010)). De-
spite the doubts mentioned by Moerchen, this book uses the temporal op-
erators of Allen, if not stated differently. If needed, additional precautions
are introduced to overcome the mentioned problems (e.g., the distance-
measure used to find similar time interval datasets introduced in section 6
utilizes the coverage ratio or spacing).
22 2 Time Interval Data Analysis
2.1.5 Temporal Concepts
Temporal concepts are used to define semantic categories for arrange-
ments of temporal operators (Moerchen 2009). Several temporal concepts
like past, present, or future, as well as order (i.e., before or after), duration,
concurrency, coincidence, or synchronicity are commonly known and often
used in natural language (cf. Moerchen (2006b), Kranjec, Chatterjee
(2010)). Regarding the context of time interval data analysis and especially
in the field of knowledge discovery (i.e., data mining) or even more specific
in the field of temporal pattern mining, temporal concepts are often used to
explain or classify patterns found within a time interval dataset. For exam-
ple, the frequent occurrence of five periodically arranged time intervals may
indicate an interesting observation. Nevertheless, searching for interesting
and infrequent patterns may also be of interest, regarding coincidences or
abnormal situations. A detailed discussion regarding temporal pattern min-
ing as a part of time interval data analysis is provided in section 2.2.1 and
3.2.2. However, within this book commonly known temporal concepts, as
exemplarily depicted in Figure 2.12, are used to express temporal arrange-
ments of temporal operators.
Figure 2.12: Examples of commonly used temporal concepts.
2.1 Time 23
2.1.6 Special Characteristics of Time
In this section, several characteristics of time are introduced which have to
be handled with special care with regards to time interval data analysis.
Depending on the context of the analysis, some characteristics may be ir-
relevant. Thus, it is advisable to validate the impact of the characteristics
within each analytical context. The introduced characteristics are: time
zones, special days (like weekends, holidays, or vacation periods), leap
seconds, leap years, absolute and relative time, as well as the general
complexity of the time dimension.
Time Zones and the Coordinated Universal Time (UTC)
The world is divided in several time zones, each defined by the specifica-
tion of an offset from the coordinated universal time (UTC). When analyzing
temporal data the time zone information is of great importance to ensure
the validity of the analytical results (cf. Kimball, Ross 2002, p. 240; Carmel
1999; Espinosa et al. 2007). Figure 2.13 illustrates an example which ex-
emplifies the importance. The figure shows time interval data recorded
within three time zones (i.e., UTC+1, UTC-8, and UTC-5). The example
implies that data collected in the time zones UTC+1 and UTC-8 represent
tasks performed at different airports. The interval shown within the UTC-5
time zone indicates an event having significant impact (e.g., 09/11, a stock
market crash, or the moon landing). Analyzing the pictured scenario with-
out taking the time zones into consideration is possible and valid, e.g., if
the dataset of an airport is analyzed separately from the other. To compare
the work-performance between the two airports (e.g., in the morning) it is
necessary to analyze the time interval dataset using local times, ignoring
any time zone information. If, on the other hand, the goal of the analysis
aims to determine the impact of the event occurred within the UTC-5 time
zone, it is necessary to perform the analysis using a normalized time (e.g.,
UTC).
24 2 Time Interval Data Analysis
Figure 2.13: Example of the impact of different time zones within the scope of temporal analytics.
In order to meet the requirements, it is necessary for an information
system and the underlying data model to understand the difference be-
tween normalized and local time, as well as the concept of time zones. The
impact of time zones is addressed in section 4.1 (regarding the modeling
of the time axis), 4.4 (with regard to different dimensional modeling), and
7.2.1 (concerning the implementation).
Daylight Saving Time (Summer Time)
Changing the time during summer to increase the duration of daylight into
the evening is a common practice in several countries. Nowadays, there
are ongoing discussions if this practice is still meaningful and a minority of
countries decided to abandon daylight saving time (DST). Nevertheless,
from an analytical point of view DST is a difficulty which has to be consid-
ered and managed (cf. Celko 2006, pp. 26–27). The main issues while
2.1 Time 25
dealing with temporal data and DST occur during two days a year (i.e., one
when the time must be adjusted back one hour, the other one when it is
forwarded). These days have 23 or 25 hours which makes it difficult to com-
pare these days to any others. The problem can be exemplified when as-
suming a company utilizing an app to measure the employees’ performed
tasks during a day. Analyzing the average amount of performed tasks within
an hour may lead to false results and therefore to erroneous decisions.
Figure 2.14 illustrates the problem regarding DST and statistical values.
Calculating the amount of time intervals between 03:00:00 and 04:00:00
results in 1 for the default (DEF), 2 for the forward (DST) and 0 for the
backward case (DST).
Figure 2.14: Illustration exemplifying the error of calculating statistical values, e.g., the amount of intervals per hour.
In general, several other statistical measures (depending on the con-
text) may be affected by DST, e.g., in the context of work time management:
26 2 Time Interval Data Analysis
the daily performance, workload, or throughput. In addition, similarity
measures (e.g., searching for similar days), which do not consider DST,
may provide incorrect matches. A further discussion on how to analyze
days with DST is presented in section 6 and 7.3.4.
Weekends, Holidays, Vacation Periods and Special Days
Depending on the context of the analysis, weekends, holidays, vacation
periods, and context specific special days may be of importance to under-
stand specific observations, patterns, or anomalies. As already mentioned
in the case of time zones, an event like a holiday or the beginning or ending
of a vacation period can have a significant impact. For example, a travel
agency’s amount of customers, and therefore the amount and duration of
consultations, may increase. Analyzing the workload without considering
vacation periods may lead to invalid conclusions. Patterns searched across
days, may differ meaningfully regarding holidays, weekends, and work
days.
Supporting different types of days8, is an important feature when ana-
lyzing time interval data (cf. Kimball, Ross 2002, pp. 38–41). The need or
importance of this additional information in the context of time interval data
analysis may depend on the location the data is recorded at (e.g., a munic-
ipal holiday) and/or the goal of the analysis (e.g., 9/11 may be an important
date considering cause studies, cf. Figure 2.13) . Some ideas on how to
handle this additional information are discussed in chapter 9.
Leap Seconds
Leap seconds are applied to the UTC to keep it close to the mean solar
time. If not applied, the UTC would drift away (Whibberley et al. 2011).
Thus, a leap second is inserted whenever the International Earth Rotation
8 A not further discussed part of analyzes is the detection of special days within a specific domain by, e.g., using cluster or classification analysis. For further information, the reader may consider Grabbe et al. (2014), which applies clustering technique to find related days based on weather information, and Christie (2003), which uses classification techniques to identify outlying performances so called major event days.
2.1 Time 27
and Reference Systems Service (IERS) decides to apply one. In the ma-
jority of cases, leap seconds are not relevant for analysis. However, Google
states in their blog-post "Time, technology and leaping seconds" that "hav-
ing accurate time is critical to everything we do at Google". Furthermore,
Pascoe states that "keeping replicas of data up to date, correctly reporting
the order of searches and clicks, and determining which data-affecting op-
eration came last are all examples of why accurate time is crucial to our
products and to our ability to keep your data safe" (Pascoe 2011). To
achieve that, Google introduced the concept of leap smear. The idea be-
hind a leap smear is to spread the additional (or shortened) second over a
specific time window (e.g., the last minute before midnight), instead of wait-
ing or shorting the last minute. It was mainly introduced, so that developers
and engineers can rely on the system time without considering leap sec-
onds at all. Within common operation systems and programming lan-
guages, leap seconds are not supported, i.e., the clock or internal counter
does not display nor handle leap seconds. Instead, the second is added by
counting the last second of the minute the leap second is scheduled for
twice.
Summarized it can be stated that leap seconds may influence the re-
sults of temporal analytics. This may be the case if the selected granularity
of time is in the range of seconds or less and the operation system handles
leap seconds by counting the last second twice. If the concept of leap
smear is applied or other specialized time protocols (e.g., Precision Time
Protocol) are used, leap seconds should not lead to any problems. Never-
theless, statistical calculation may be off by up to one second. Within this
book, the handling of leap seconds in association with the introduced in-
formation system is discussed in section 4.1.
Leap Years
The Gregorian calendar differentiates between common years and leap
years. The former has 365 days, whereas the latter has 366 days (adding
28 2 Time Interval Data Analysis
the 29th of February, namely the leap day). Depending on the level of ag-
gregation used when analyzing temporal data, the existence of a leap day
within a year may or may not invalidate the results. Thus, statistical
measures aggregated on a year-level (e.g., sum or count) are not compa-
rable between a leap year and a common year. A solution, to overcome this
problem is the usage of relative values (e.g., mean or median) or a com-
parison on a valid level (e.g., by comparing sorted sets ignoring the addi-
tional day). In this book, the handling of leap years is discussed in section
6 and 7.3.4.
Absolute vs. Relative Time
Time dependent data can be collected in an absolute or a relative manner.
In general, an absolute time interval consists of two time points each spec-
ified by date, time, and time zone9. Contrary to this, a relative time interval
consists of two time points each typically specified by an integer or a float-
ing point number. Thus, relative time interval data is mostly found in sce-
narios in which the absolute time is irrelevant, e.g., when comparing time
interval data collected from several process runs, each starting at a nor-
malized moment in time, e.g., 0. Most researches within the field of data
mining assume relative time interval data for their pattern mining algorithm.
Nevertheless, in the context of on-line analytical processing (OLAP) and
mining (OLAM), which both considers the existence of dimensions, abso-
lute time interval datasets are mostly used. Thus, an information system
has to be capable of handling relative and absolute time interval data (cf.
section 4.1).
Complexity of Time Dimension
The time dimension is an important and probably the most frequently used
dimension within multidimensional models (cf. Kimball, Ross 2002, pp. 38–
41). Considering OLAP and temporal data, aggregating data along the time
9 The time zone information is often omitted because the system’s local time zone is expected to be implicitly used.
2.2 Features of Time Interval Data Analysis Information System 29
dimension is one of the pre-dominant operations (Agarwal et al. 1996;
Chaudhuri, Dayal 1997; Zhang et al. 2001), e.g., analyze the different
months, detect anomalies, and understand the reasons for the anomalies
by looking at the days of the months. In the field of temporal pattern mining,
the different levels of the time dimension are often used to specify time
dependent filters or ranges, e.g., detect frequent patterns occurring on
Mondays. Using the time dimension in the context of analytics reveals sev-
eral problems.
One of the problems to deal with is the fact that a calendar week does
not neatly fit into a month nor a year. Thus, a time hierarchy like day →
calendar week → month → year risks summarizability and comparison
problems (Hutchison et al. 2006; Mansmann, Scholl 2006; Mazón et al.
2008). Solving, or at least revealing this problem to the querying user, is an
important aspect to ensure correct usage of provided results. In section
3.2.1, several solutions on a conceptual or logical level are presented. In
section 4.4, the modeling of the time dimension considering an information
system for time interval data analysis is introduced and the handling of the
mentioned problem is further discussed.
Another problem when dealing with the time dimension is the already
mentioned variety of additional information attached to a member. A day
may be, e.g., a global or municipal holiday, a memorial day, or a special
event like tax day or 9/11 (cf. Weekends, Holidays, Vacation Periods and
Special Days). Considering the time dimension, such additional information
may be used to define special hierarchies (e.g., days may be rolled-up to
a level containing members like none, municipal, national, and international
holiday). Special time hierarchies are discussed in section 4.4.
2.2 Features of Time Interval Data Analysis Information System
As noted in the introduction of this chapter, several workshops with ana-
lysts from different domains were organized addressing the issues oc-
curring when analyzing time interval data. The first workshop "Business
30 2 Time Interval Data Analysis
Intelligence: How do you use your temporal data?" was held with 64 inter-
national companies (mainly aviation industry, logistics providers, and
ground-handling service providers) during the "Inform Users Conference
2012". Additional workshops were organized during the following years
aiming to reveal further insights, understand specific problems (e.g., occur-
ring using proprietary software products), or to specify requirements (e.g.,
regarding the query language or special visualizations). The number of par-
ticipants varied according to the purpose of the workshop and was distrib-
uted among a number of different sectors, i.e., aviation, logistic, ground-
handling, call-center, hospitals, temporary employment, and linguistic. Al-
together, more than 20 workshops, organized as expert discussions (i.e.,
between three or six experts from one or different companies), as business
users workshop (i.e., up to 10 managers and experts were invited to dis-
cuss expected results), or as part of a users’ conference (i.e., more than
20 experts), were held between 2012 and 2015.
The following sections present features derived from the results of the
workshops and complemented by an extended literature research. The dif-
ferent features are categorized in analytical features (section 2.2.1), fea-
tures defined along a time interval data analysis process (section 2.2.2),
and features associated to the user interface (UI) of an information system
for time interval data analysis (section 2.2.3). These features can also be
understood as functional requirements. Nevertheless, non-functional re-
quirements (e.g., regarding the performance or robustness) are not dis-
cussed in detail. Instead, relevant non-functional requirements are dis-
cussed and motivated implicitly within the different sections and used to
motivate specific implementation strategies (i.e., authorization and authen-
tication in section 5.1, indexing in section 7.3.2, and caching in section
7.3.3).
2.2.1 Analytical Capabilities
In the field of data analysis, a distinction is made between different analyt-
ical techniques. In general, techniques are categorized in descriptive
2.2 Features of Time Interval Data Analysis Information System 31
("What has happened"), predictive ("What could happen"), and prescriptive
("What should happen") analytics (IBM Corporation 2013). During the
workshops one of the goals was to determine which techniques must be
supported and how the support may be realizable by specifying desired
features. The results indicate that, regarding the analysis of time interval
data, a demand for all three categories exists. Nevertheless, none of the
categories is currently satisfactorily covered by any available information
system and the importance differs between the three categories.
Descriptive Analytics
The results of the workshops indicate that the need for descriptive analytics
is very high. Experts stated that "understanding the current situation and
past observations", as well as "being able to determine causes for anoma-
lies" are important first tasks. The feature requests assigned to the type of
descriptive analytics are listed in Table 2.1.
Table 2.1: Overview of the features requested in the category descriptive analyt-ics.
Identifier Description Priority DA-01 As an analyst, I want to aggregate the time in-
terval data along the time-axis, using different aggregation methods (must: SUM, COUNT, MAX, MIN, MEAN; should: MEDIAN; can: MODE). The aggregation must be correct con-sidering summarizability.
critical
DA-02 As an analyst, I want to be able to use temporal aggregation methods along the time-axis (must: COUNT STARTED, COUNT FINISHED).
high
DA-03 As an analyst, I want to be able to retrieve the raw time interval data within a specified time window (i.e., by using a query language). In ad-dition, it should be possible to specify the tem-poral operator specifying the relation between the interval to be retrieved and the time window (e.g., retrieve all intervals equal to the specified time window).
high
32 2 Time Interval Data Analysis
DA-04 As an analyst, I want to roll-up and drill-down the time dimension. The levels of the different time hierarchies should support the definition of buckets for lower granularities (i.e., minutes and seconds).
critical
DA-05 As an analyst, I want to specify dimensions for the different properties associated to the time in-terval. Furthermore, I want to use these dimen-sions to generalize or specialize the result.
critical
DA-06 As an analyst, I want to analyze data from dif-ferent time zones. More specifically, I want to be able to analyze data from different time zones using local time zones, as well as a generalized time zone like UTC.
medium
DA-07 As an analyst, I want to be able to compare, e.g., hours, days, or weeks. In addition, I should be capable of searching for similar situations by selecting a template, e.g., hour, day, or week.
medium
DA-08 As an analyst, I want the system to provide a query language to retrieve analytical results (i.e., time series, mining results)
critical
Figure 2.15 exemplifies selected features, i.e., DA-01 (aggregate),
DA-03 (select records), DA-04 (roll-up & drill-down time dimension), and
DA-05 (roll-up to department & drill-down to work-area). The raw intervals
(top left, DA-03) are aggregated applying count aggregation on the lowest
granularity (top middle, DA-01). The roll-up and drill-down operations are
applied (illustrations on the lower part of the figure, DA-04). The realization
of these features is addressed in the context of modeling the time axis (cf.
section 4.1) and dimensional modeling (cf. section 4.4). In addition, solu-
tions for overcoming the summarizability problems occurring while realizing
these features10, are presented in section 7.3.4.
10 The problems occur when using available proprietary software (cf. Mazón et al. (2008)) or algorithms presented in the field of temporal databases (cf. section 2.1.2). Lately, several proprietary tools like icCube, Microsoft Analysis Services, or IBM Cognos presented fea-tures to support many-to-many relationship (cf. Russo, Ferrari (2011)). Nevertheless, as
2.2 Features of Time Interval Data Analysis Information System 33
Figure 2.15: Overview of selected features defined in the category descriptive an-alytics in the context of time interval data analysis (cf. Table 2.1).
At this point, the features DA-02, DA-06, DA-07, and DA-08 are not pre-
sented in the figure. A detailed introduction for these features is given in
the relevant section which introduces a concrete solution, several exam-
ples, as well as modeling, definition, and implementation aspects, i.e., sec-
tion 7.3.4 (DA-02), section 4.4 (DA-06), chapter 6 (DA-07), and section
5.3.3 (DA-08).
discussed in section 3.2.1, these solutions cannot be applied satisfactorily in the context of time interval data.
34 2 Time Interval Data Analysis
Predictive Analytics
In the case of prescriptive analytics the workshops have shown that the
need is not rated as high as for descriptive analytics. One of the reasons
stated by experts is the assumption that without appropriate descriptive
analysis tools, features regarding predictive or prescriptive analysis are dif-
ficult to formulate. Another reason, indicated by experts, may be the avail-
ability of appropriate, proprietary software. For example, in the case of
workforce management, several software products are available, e.g., use-
ful to create rule-based rosters or simulate defined scenarios. The issues
arising when using these tools are the definition of the rule-sets or the sce-
nario’s parameters. To formulate such a rule-set or determine the parame-
ters, a better understanding of current and past situations is required which
support the necessity of descriptive analytics. Nevertheless, some aspects
of predictive analytics were classified as meaningful and are summarized
in Table 2.2.
Table 2.2: Overview of the features requested in the category predictive analyt-ics.
Identifier Description Priority
PD-01 As a manager/supervisor, I want to be able to observe specified measures and be alerted if a defined threshold may be reached in the near future.
medium
PD-02 As an analyst, I want to be able to find patterns or rules within a time interval dataset. Thus, it is necessary to specify the scope of the mining (e.g., just Mondays or holidays). In addition, it is of interest to validate if a pattern found within Mondays can also be found within other sets, e.g., Tuesdays, weekdays, or days of July.
low
2.2 Features of Time Interval Data Analysis Information System 35
Prescriptive Analytics
The aim of prescriptive analytics is to optimize upcoming situations by
knowing what should ideally happen and rate different outcomes. The ar-
guments mentioned in the case of predictive analytics apply, as well, in the
case of prescriptive analytics. There are several tools used by data scientist
enabling prescriptive analytics. However, the access to time interval data is
quite difficult. Thus, an information system, as introduced in this book, is
needed to provide an easy access and help for analyzing data in a descrip-
tive way, prior to any prescriptive analysis. Regarding the results of the
workshops, the requests expressed in the field of predictive analytics over-
lap with the once of prescriptive analytics. Table 2.3 shows a concise sum-
mary for the mostly openly formulated feature requests.
Table 2.3: Overview of the features requested in the category prescriptive ana-lytics.
Identifier Description Priority
PR-01 As a manager, I want the system to be able to predict upcoming situations (e.g., staff short-ages) and provide solutions to the responsible dispatcher.
low
PR-02 As an analyst, I want the system to be usable with other tools useful for prescriptive analytics (e.g., R11, Apache Spark12, or Watson Analyt-ics13).
low
2.2.2 Time Interval Data Analysis Process
Another purpose of the workshops was the determination of a generalized
process for time interval data analysis, applicable to an information system.
11 http://www.oracle.com/technetwork/database/database-technologies/r 12 https://spark.apache.org 13 http://www.ibm.com/analytics/watson-analytics
36 2 Time Interval Data Analysis
In general, the process of data analysis14, also known as data science pro-
cess, is defined by several iterative phases (Schutt, O'Neil 2014, pp. 41–
44). Figure 2.16 depicts the data science process.
Figure 2.16: The data science process following Schutt, O'Neil (2014).
The process starts with the "Raw Data Collection" step, which is fol-
lowed by the "Processed Data" step. Typically, data integration techniques
are used by an analyst to process data in a way to create organized data
ready for analysis. Nevertheless, the organized data may contain missing
information, invalid entries, or duplicates. Thus, a clean dataset is derived
during the second step by applying, e.g., data enrichment, outlier detec-
tion, or plausibility check techniques. In order to obtain a clean dataset or
understand the data, it may be necessary to use exploratory data analysis
(EDA) techniques, used to reveal further insights and clarify the validity.
Having a clean dataset and understanding it, enables the analyst to detect,
e.g., relationships, patterns, or causalities ("Apply Models & Algorithms").
Models may be generated and applied during this step to simplify the anal-
ysis. During the last steps, i.e., "Data Product" and "Communicate, Visual-
ize, Report" the results created (e.g., a model, a rule, or a cause) and in-
sights gained are used by a data product (i.e., an application) to create
14 The process is comparable to the knowledge discovery in databases (KDD) process (Fay-yad et al. 1996) or the more general visual analytics process (Keim 2010, pp. 10–11).
2.2 Features of Time Interval Data Analysis Information System 37
(automated) results (e.g., recommendations) or are presented to a decision
maker.
The data science process aims to encapsulate the tasks performed by
an analyst when analyzing any kind of data. Thus, it is applicable to time
interval data analytics. Nevertheless, from an information system point of
view the process is to generic and wide. Discussions during the different
workshops have shown that from an analyst point of view several steps
should be redefined or narrowed. In addition, it was pointed out that an
information system may have to perform tasks automatically on each single
time interval data record pushed into the system (c.f. feature request PD-
01). Figure 2.17 illustrates the time interval data analysis process based
on the results of the workshops. The figure differentiates between steps
which should be supported by an information system (colored boxes) and
steps performed by other systems, an analyst or a user (white boxes). Sup-
porting describes the ability of the information system to perform the step
automatically (e.g., based on configuration or modeling). In contrast to the
data science process, the depicted time interval data process described
the steps from an information system or data point of view instead of the
perspective of an analyst. The analyst uses the information system to
query, interact, or understand the time interval dataset and additionally
configure and model the system (which is a cross-sectional task, and there-
fore not illustrated).
38 2 Time Interval Data Analysis
Figure 2.17: The result of the workshops regarding the time interval data analysis process.
The process starts with the collection of time interval data from an avail-
able and configured source. The collection might be a recurring (i.e., load
the data whenever new data is available) or a one-off task (i.e., load data
once into the system to analyze the set). The information system processes
the incoming data using defined data integration techniques (step: "Pro-
cessed Data"). Within the next step, the processed data is cleaned and a
valid dataset is received (step: "Clean Dataset"). At this point, the analyst
is capable to interact with the system, e.g. by firing queries or using a pro-
vided UI, useful to perform hypothesis testing, validation, or monitoring
(step: "Retrieve, Visualize"). In addition, the analyst might retrieve and vis-
ualize results created by defined exploratory data analysis tasks, data min-
ing algorithm, or machine learning concepts (step: "Apply Algorithms &
Models"). Depending on the configuration of the information system, the
defined algorithms and models are applied automatically used to deter-
mine if an alert has to be generated (step: "Data Observer") or report re-
sults to a decision maker (step: "Communicate, Visualize, Report").
2.2 Features of Time Interval Data Analysis Information System 39
In the following, the requested features for the steps: "Raw Time Interval
Dataset" (Data Linkage & Collection), "Processed Data and Clean Dataset"
(Data Integration & Cleansing), and "Apply Algorithms & Models" (Applica-
tion of Models & Algorithms) are introduced and discussed. Features de-
manded in the context of visualization and interaction (i.e., steps "Retrieve,
Visualize" and "Communicate, Visualize, Report") are presented in section
2.2.3. Requirements considering the "Data Observer" step are considered
in section 2.2.1 (cf. Predictive and Prescriptive Analytics).
Data Linkage & Collection
An information system for time interval data analysis has to provide inter-
faces enabling the loading of data into the system. During the first devel-
opment phases and workshops several different ways on loading data into
the system were discussed. Furthermore, scalability and data integrity
were important topics when discussing the topic of data collection. Table
2.4 shows the subsumed features requested.
Table 2.4: List of requested features for the information system considering data collection.
Identifier Description Priority
DC-01 As a system provider, I want the system to sup-port different data sources, e.g. databases (i.e., relational DBMS), files (i.e., CSV or XML), and streams (i.e., JSON). If not supported, a simple application programming interface (API) must be available to enable me to add unsupported data sources.
High
DC-02 As an analyst, I want the provision of a Java Da-tabase Connectivity (JDBC) driver and a query language which allows the insertion and deletion of data. In addition, bulk loading operations should be supported.
Critical
DC-03 As a system provider, I want to be able to spec-ify pre-aggregates to be calculated by the sys-tem, to increase query performance.
High
40 2 Time Interval Data Analysis
Although the features requested are mostly self-explanatory, it should
be mentioned that the realization of these feature is presented and dis-
cussed further in section 7.2.1 (DC-01), section 5.3.1 (DC-02), and section
7.3.4 (DC-03).
Data Integration & Cleansing
Whenever data is loaded into the information system, it is important that
the data is integrated and cleaned, so that invalid entries are detected,
missing data is enriched, and the internally needed data structure is ap-
plied. The discussions considering data integration and cleansing was di-
versely, especially the question: "Which data integration techniques must
be available by the system and at which point dedicated data integration
tools should be applied as pre-processor". Table 2.5 shows the results of
the discussions and additional feature requests defined within the work-
shops.
Table 2.5: List of requested features for the information system considering data integration & cleansing.
Identifier Description Priority DI-01 As an analyst, I want the system to be capable
to handle complex data structures, in particular many-to-many relationships (cf. Kimball, Ross (2002), Mazón et al. (2008)).
Critical
DI-02 As an analyst, I want to be able to validate the descriptive values (properties) associated to the time interval. Validation must ensure that the value is not empty (i.e., mark a property as re-quired), that the value is allowed to be used (i.e., by providing a white-list), or how a new value is handled (i.e., add it, use null, or fail).
High
DI-03 As an analyst, I want to be able to define how undefined intervals (i.e., intervals which have no start, end, or neither defined) are handled. Typ-ically, I should be able to pick one of the follow-ing strategies: use time axis boundaries, use the
High
2.2 Features of Time Interval Data Analysis Information System 41
(other) specified value (i.e., create a time point), or fail.
DI-04 As an analyst, I want to be able to write scripts applied to the raw data prior to any processing or cleansing. Thus, I am able to manipulate the incoming data without pre-processing it using in-tegration tools.
Medium
The feature requests DI-02 and DI-03 are defined to cover important
and often, in the context of time interval data analysis, applied strategies.
The specified strategies are used to ensure data quality (by plausibility
checks) or to offer the possibility to enrich missing values. DI-04 is re-
quested as a last resort, i.e., the information system should offer a scripting
interface useful to implement integration or cleansing techniques. This in-
terface enables an analyst to apply techniques prior to using additional
data integration tools. In addition, the interface might even be used to trig-
ger a more complex integration process defined with a proprietary integra-
tion tool (cf. Meisen et al. (2012)).
The requirement formulated with feature request DI-01 addresses the
already mentioned summarizability problem, which occurs when using
many-to-many relationships and is introduced in detail in section 3.2.1. Re-
garding the used model introduced in chapter 4, the feature request DI-02
is partly covered by so called mapping functions (cf. section 4.1 and 4.2).
In addition, the final implementation provides additional strategies to fulfill
the request (cf. section 7.2.1).
Application of Models & Algorithms
The requested capabilities of the information system considering descrip-
tive, predictive and prescriptive analytics are listed in section 2.2.1. In ad-
dition, this section specifies architectural requirements to be met by the
system to support these analytical capabilities. The features requested are
listed in Table 2.6 and the implementation is introduced in section 7.1.
42 2 Time Interval Data Analysis
Table 2.6: The features required to support the application of models and analyt-ical algorithms.
Identifier Description Priority
MA-01 As an analyst, I want to be able to apply models or algorithms to the data stream, i.e., I want to determine problems, generate alerts, report anomalies, or classify the current data.
Medium
MA-02 As an analyst, I want to be able to schedule analysis (e.g., daily) using the currently availa-ble data. Depending on the result of the analysis I want to trigger an action (e.g., send an email).
Medium
2.2.3 User Interface, Visualization, and User Interactions
An important criterion regarding the user acceptance of a system is its in-
terface. The UI may be graphical (e.g., showing a graph) or a query lan-
guage. In general, the user needs capabilities to interact with the system,
so that a request can be specified or an alert be understood. Table 2.7
shows the features relevant for the information system. Features dealing
with specific visualization15 are not listed, because the development of spe-
cific visualizations are not in the scope of this book. Nevertheless, the in-
terested reader is referred to section 3.2.3., which introduces current state
of the art visualizations regarding time interval data and time series. Ideas
considering the usage of visual analytics techniques in the context of time
interval data analysis are discussed in section 7.4.
Table 2.7: Overview of the features requested for the UI, visualization, and user interaction.
Identifier Description Priority
VIS-01 As an analyst, I want to be able to retrieve data from the information system using a JDBC driver to visualize the results, e.g., using a third party business intelligence tool, a visualization,
High
15 E.g., a specific request for a line chart was to show the involved time intervals in a tool tip when hovering the value.
2.3 Summary 43
or another analytical framework. Thus, I implic-itly request a query language useful to retrieve data as needed.
VIS-02 As an analyst, I want to be able to subscribe to the system’s alerts and analytical results. The system must publish the requested information to any subscribed instance.
Medium
VIS-03 As a system provider, I want to have a UI for user management (i.e., delete or add users, de-fine roles, grant or revoke a permission).
Critical
VIS-04 As an analyst, I want to have a minimal graph-ical user interface (GUI) useful to request and visualize results (e.g., a time series, resulting datasets, or a Gantt-chart).
High
VIS-05 As a web-developer, I want the system to pro-vide web-friendly services, i.e., requesting and receiving data through a JSON interface.
High
2.3 Summary
Within this chapter, several important terms within the context of time in-
terval data analysis were introduced. In addition, features related to an in-
formation system supporting analytical tasks were presented. These fea-
tures are motivated along temporal aspects and characteristics of time
(e.g., temporal models, leap years, or time zones), as well as subsumed
results from several workshops and an extended literature research. Fur-
thermore, some subordinate features mentioned during the workshops, like
specific requirements regarding specific statements of the query language,
are not listed. Nevertheless, these feature requests are stated within the
different upcoming chapters, if relevant.
This chapter also provides the answer to the first RQ: "Which features
must be supported by an information system to enable time interval data
analysis". An information system has to support the time characteristics,
as well as provide the specified features in a performant way. An evaluation
44 2 Time Interval Data Analysis
regarding the fulfillment of the features is presented in section 8.1. In addi-
tion, these features provide the basis for the other research questions. A
model for time interval data analysis (as mentioned in RQ2) is needed as
formal framework for such an information system. The need for a query
language (as addressed by RQ3) is explicitly or implicitly mentioned in sev-
eral features (e.g., DA-01, DA-02, DA-03, DA-08, PR-02, DC-02, or VIS-
01). The performance of an analytical information system is, even if not
explicitly mentioned, of importance and the core issue of RQ4. The similar-
ity among difference sets of time interval data is requested by feature DA-
07 and topic of RQ5. The architecture and configuration of an information
system are aspects to consider when realizing such a system. In addition,
the needed interfaces (e.g., JDBC, JSON, or visualization) of time interval
data and results of analyses are addressed by, e.g., DC-01, DC-02, VIS-
01, VIS-04, and VIS-05. The RQ6 subsumes the mentioned aspects re-
garding the architecture, configuration, and interfaces.
3 State of the Art
Time interval data is in the focus of research over the past years and dec-
ades. Several aspects, dealing with (time) interval data, have been ad-
dressed and are introduced in this chapter. As motivated in chapter 2, the
following research areas are of interest when implementing an information
system useful to analyze time interval data: concepts applied when creat-
ing analytical information systems (section 3.1), different approaches re-
garding the analysis of time interval data (section 3.2), query languages
used to answer analytical questions (section 3.4), and similarity measures
(section 3.5). In addition, the so far only peripherally mentioned perfor-
mance improvements (section 3.3) are important research areas regarding
the performance of the query processing.
3.1 Analytical Information Systems
The term analytical information systems (AIS) is used in general as a "de-
scriptor for a broad set of information systems that assist managers in per-
forming analyses" (Power 2001), which is often used in conjunction with BI,
Decision Support Systems (DSS), Data Warehouses (DW), or OLAP (Stroh
et al. 2011; Teiken 2012, p. 7). In general, "analytics software encompasses
three main technologies: (1) database management, (2) mathematical and
statistical analysis and models, and 3) data visualization and display"
(Power 2012).
In science, the term AIS is used in different areas, e.g., in the field of
spatial data processing (e.g., Goodchild (1987) or Paramonov et al.
(2013)), regarding solutions for specific domains like power supply or
budget planning (e.g., Kamaev et al. (2014) or Rego et al. (2015)), or gen-
erally, as already mentioned, as synonym for DSS, BI, DW, or OLAP. Thus,
AIS for a specific type of data is only considered in the field of spatial data
and geographic information system (GIS). The architectures presented in
the different domain-specific or BI related solutions are based on several
© Springer Fachmedien Wiesbaden GmbH 2016P. Meisen, Analyzing Time Interval Data,DOI 10.1007/978-3-658-15728-9_3
46 3 State of the Art
components like databases, integration tools, meta layer, data ware-
houses, and an application (Teiken 2012, pp. 8–15). A holistic solution en-
capsulating these different components to analysis specific data is not pre-
sented.
3.2 Analyzing Time Interval Data: Different Approaches
Within the field of data analysis several technologies, techniques, and
methodologies have been introduced. From an algorithmic point of view the
developed solutions can be categorized into statistical analysis (i.e., de-
fined by Dodge, Marriott (2006) as "the study of the collection, analysis,
interpretation, presentation and organization of data"), data mining (i.e.,
defined by Fayyad et al. (1996) as "a step in the KDD process that consists
of applying data analysis and discovery algorithms that produce a particu-
lar enumeration of patterns"), machine learning (i.e., defined by Arthur
Samuel in 1959 as " [the] field of study that gives computers the ability to
learn without being explicitly programmed"), and visual analytics (i.e., de-
fined by Thomas, Cook (2005, p. 4) as "the science of analytical reasoning
facilitated by interactive visual interfaces"). Within the context of AIS and
time interval data analysis, the following research topics are of special in-
terest16, i.e., OLAP (section 3.2.1) useful to perform hypothesis testing,
temporal pattern and association rule mining (section 3.2.2) suitable to find
patterns, and visual analytics (section 3.2.3) appropriate to enable the user
to visualize data and discover new insights by using innovative interaction
techniques. Other topics like, e. g., clustering, supervised learning, or re-
gression, known from machine learning or data mining, are not further dis-
cussed nor introduced17.
16 The fields were selected according to the formulated feature requests listed in section 2.2. 17 The information system provides an interface to apply models or algorithms as requested
by feature MA-01 and MA-02 (cf. section 2.2.2). Thus, the algorithms or models are not in the focus and are assumed to be applied. Nevertheless, the information system may be used to create models or algorithms by providing data and deeper understandings.
3.2 Analyzing Time Interval Data: Different Approaches 47
3.2.1 On-Line Analytical Processing
For several years, business intelligence and analytical tools have been
used by managers and business analysts, inter alia, for data-driven deci-
sion support on an operational, tactical, and strategic level. An important
technology used within this field is OLAP, used especially for hypothesis
testing. OLAP enables the user to interact with the stored data by querying
for answers. This is achieved by selecting dimensions, applying different
operations to selections (e.g. roll-up, drill-down, or drill-across), or compar-
ing results. The heart of every OLAP system is a multidimensional data
model (MDM), which defines the different dimensions, hierarchies, levels,
and members (Cood et al. 1993). Recent research dealing with OLAP is
focused on: summarizability problems (Lenz, Shoshani 1997; Mazón et al.
2008, 2009; Niemi et al. 2014) and MDM (Kimball, Ross 2002; Chui et al.
2010; Koncilia et al. 2014; Meisen et al. 2014). In addition, different solu-
tions for specific scenarios were presented, e.g., in the context of big data
Wang, Ye (2014) introduce an in-memory cluster computing environment
based on a key-value index, Mendoza et al. (2015) present new textual
measures useful to handle unstructured textual information with OLAP, and
Cuzzocrea (2011) proposes a framework to be used to estimate the result
of OLAP queries in uncertain and imprecise data. In the following, the most
relevant developments for the context of AIS and time interval data are pre-
sented, i.e., research addressing summarizability problems and MDM.
Summarizability Problems
In the field of OLAP, researchers discuss the importance of summarizabil-
ity, which "refers to the possibility of accurately computing aggregate val-
ues with a coarser level of detail from values with a finer level of detail"
(Mazón et al. 2008), and the problems occurring when violating it. In addi-
tion, summarizability is a necessary pre-condition for performance optimi-
zation using pre-aggregation techniques (Pedersen et al. 1999). The sum-
marizability problem addresses the issue of violating summarizability,
48 3 State of the Art
which is always the case if non-strict hierarchies are used within the multi-
dimensional model. Furthermore, summarizability problems may occur if
non-covering or non-onto hierarchies are defined, depending on the tech-
nique used to support this type of hierarchy within the logical model (cf.
Spaccapietra et al. 2009, p. 73). Figure 3.1 illustrates the different types of
hierarchies.
Figure 3.1: Examples of the different types of hierarchies (non-strict, non-cover-ing, and non-onto).
In general, the summarizability problem states the multiplication of a
fact, if the fact is associated to multiple members of a higher level (as illus-
trated in Figure 3.1). The problem also arises, if a member refers to several
members on a higher level. In both cases, the fact is multiplied within the
aggregation on the higher level. Considering time interval data, the problem
of many-to-many relationships is always present, because a fact of the in-
terval is associated to multiple members of the time dimension (i.e., all time
points the interval covers). Figure 3.2 shows two examples of the summa-
rizability problem. On the left side, the number of patients (fact) is associ-
ated with one or multiple diagnoses (cf. Pederson (2000), Song et al.
(2001)). When selecting all patients, a non-aware system would return a
number of 29 patients (5 cancer, 12 stroke, and 12 cancer). On the right
side, an example of a time interval is illustrated. In that case, the resources
(fact) associated to the interval are counted multiple times, i.e., for each
chronon covered by the interval.
3.2 Analyzing Time Interval Data: Different Approaches 49
Figure 3.2 Two examples of the summarizability problem.
Lately, several proprietary tools like icCube, Microsoft Analysis Ser-
vices, or IBM Cognos implemented the support for non-strict hierarchies
(Russo, Ferrari 2011). As mentioned by Meisen et al. (2014), the presented
implementations are not sufficient when using time interval data. Reasons
are:
– insufficient tooling support (i.e., inadequate lowest granularity and poor
query performance),
– expensive data integration processes (i.e., enormous redundant data
creation, costly discretization of intervals, and unmaintainable configu-
rations),
– non user-friendly query language (i.e., complex language structure
and unsupported temporal semantics), as well as
– inapplicable requirements (i.e., unsupported context specific aggrega-
tions and unsatisfying linkage between intervals and aggregated val-
ues).
Thus, some OLAP-application can interpret non-strict hierarchies and
overcome summarizability problems. However, in the context of time inter-
val data, these solutions are not applicable.
Multidimensional Models
An MDM defines the dimensions, hierarchies, level, member, and facts
within data. Such a model enables the use of operations like roll-up, drill-
50 3 State of the Art
down, slicing, or dicing and facilitates rapid data access using relational
databases (ROLAP), multidimensional array structures (MOLAP), or a hy-
brid implementation (HOLAP). Typically, data integration techniques are
needed to map the raw data to a specified MDM. In addition, further meth-
ods, e.g., data cleansing, data enrichment, or aggregation, are applied
within the integration process to ensure data validity, completeness, and
quality (White 2005). In the field of OLAP, several systems capable of an-
alyzing sequences of data have been introduced over the last years. Chui
et al. (2010) introduced S-OLAP for analyzing sequence data. Liu et al.
(2011) analyzed event sequences using hierarchical patterns, enabling
OLAP on data streams of time point events. Bębel et al. (2012) presented
an OLAP like system enabling time point-based sequential data to be an-
alyzed. Nevertheless, these systems and their models neither support time
intervals, nor temporal operators. Recently, Koncilia et al. (2014) and
Meisen et al. (2014) presented a MDM focusing on time interval data anal-
ysis. Both claim to be the first to present such a model.
Koncilia et al. (2014) presented a system named I-OLAP, claiming to be
the first proposing a model for processing interval data. An interval is de-
fined as the gap between two events18. Furthermore, the introduced meta-
model consists of events, dimensions, hierarchies, members, intervals, se-
quences of intervals, and so called I-Cubes. A definition, which types of
hierarchies are supported is not presented. Thus, the support of non-strict
hierarchies and how these would be handled is unclear. In addition, Kon-
cilia et al. assume that the intervals of a specific event-type (e.g., apple
falling) for a set of specific properties (e.g., color and weight) are non-over-
lapping and consecutive (i.e., form a non-overlapping sequence of inter-
vals). This assumption is valid in the specific case of event sequences.
Nevertheless, in the more general case of time interval datasets, the as-
sumption of Koncilia et al. is not valid. E.g., assuming a work-area with
18 A more detailed definition of term event is presented in section 3.2.2.
3.2 Analyzing Time Interval Data: Different Approaches 51
several workers performing several tasks in parallel19 is one of many pos-
sible scenarios in which the assumption does not hold true.
To support the specific handling of facts and measures Koncilia et al.
introduce two types of functions, i.e. compute value functions and fact cre-
ating functions, which are used to determine the measure of two consecu-
tive events (i.e., e1 and e2, with e1.t < e2.t, so that there is no other e with
e1.t < e.t < e2.t) for all chronons t, which fulfill e1.t < t < e2.t. In addition, two
different aggregation techniques are presented, time point aggregation, as
well as aggregation along time. The former is used to calculate the aggre-
gated value for a specified time point (i.e., chronon) and the latter is used
to determine the aggregated value for a specified time range (cf. TAT intro-
duced in section 2.1.2).
Figure 3.3 illustrates an example supported by I-OLAP. The example
shows several values measured by a temperature sensor (i.e., 3, 4, 2, 5,
and 1; shown as dots). To determine the intervals between the gaps of the
events, the already mentioned compute value functions and fact creating
functions are applied. In the example shown in Figure 3.3 the average func-
tion is used to determine the value for each chronon (i.e.,
(e1.value + e2.value) · 0.5).
Figure 3.3: Illustration of a scenario covered I-OLAP as presented by Koncilia et al. (2014).
19 Analyzing several time interval data from service providers showed that even one worker performs several tasks simultaneously (e.g., check-in and customer service).
52 3 State of the Art
Summarized it can be stated that the model defined in the context of I-
OLAP:
– supports the TAT aggregation technique,
– can be used to define measures computed from events and intervals,
– is limited regarding the supported data, i.e., only sets of intervals over
sequential data are considered,
– lacks to define which types of hierarchies are supported (cf. sec-
tion 3.2.1: Summarizability Problems),
– does not introduce the handling of temporal aspects (cf. DA-02, DA-03,
DA-06, or DA-07),
– may not be capable to support all required aggregation methods, e.g.,
MEAN or MEDIAN (cf. DA-01), and
– cannot be applied to larger datasets in a performant way, i.e., the pre-
sented ideas and remarks suggests that the runtime is at least polyno-
mial in the number of intervals.
At the same time as the presentation of I-OLAP, Meisen et al. (2014)
presented the TIDAMODEL. The introduced model covers all types of hierar-
chies (i.e., non-strict, non-covering, and non-onto). In addition, a perfor-
mant implementation capable to overcome summarizability problems is
outlined and further specified in Meisen et al. (2015b). In chapter 4 of this
book, the TIDAMODEL is introduced and discussed in detail. In addition, sev-
eral new aspects not addressed by Meisen et al. (2015b) are introduced
and aligned against the requests mentioned in section 2.2.
3.2.2 Temporal Pattern Mining & Association Rule Mining
Research in the field of data mining and in the context of time interval da-
tasets mainly focus on temporal pattern mining and association rule mining
(Moerchen 2009; Papapetrou et al. 2009). The different mining algorithms
presented over the last years differ in the representation of time interval
data (i.e., the model used to represent the time intervals), the type of pat-
terns searched for (i.e., frequent, closed, or complete-set patterns, cf. Hu
3.2 Analyzing Time Interval Data: Different Approaches 53
et al. (2010), Chen et al. (2011)), the performance (i.e., number of data-
base scans needed), or constraints (i.e., applying specific constraints to
the patterns to find, cf. Laxman et al. (2007), Peter, Höppner (2010)). In
addition, other topics like clustering (Guyet, Quiniou 2008; Fricker et al.
2011), classification (Batal et al. 2011), or predictions are of interest to re-
search.
In this book, the primary focus is on the application of the algorithms
presented in the context of mining time interval datasets (cf. section 2.2).
Thus, the information system must be capable to provide the time intervals
in a way, so that the mining algorithm can be applied. All mining techniques
regarding temporal sequential pattern mining or temporal association rule
mining in the context of time interval data are based on a definition provided
by Papapetrou et al. (2005). Papapetrou et al. were one of the first to intro-
duce the problem of "discovering frequent arrangements of temporal inter-
vals". The problem stated by Papapetrou et al. is based on so called e-se-
quences. An e-sequence is a (temporally) ordered set of events, whereby
an event is defined by a start value, an end value, as well as a label. In
addition, an e-sequence database is defined as a set of e-sequences. The
definition of an event given by Papapetrou et al. is close to the definition of
an interval outlined in section 2.1.1 and the formal definition presented in
section 4.3. In addition to the definition of Papapetrou et al., the definition
presented in this book allows the categorization of an event by multiple
properties20 (i.e., labels), as well as the assignment of facts (i.e., values
which can be aggregated).
Summarized, the commonly used model of time interval data used in
the field of pattern or association rule mining does not recognize any di-
mensional aspects. Nevertheless, regarding the increasing usage of di-
mensional information within the field of pattern mining - often referred to
20 One may argue that the support of a single label is sufficient. In the context of pattern min-
ing, multiple labels might be transformed to a concatenated single label. However, applying dimensional information within the mining process is not possible. Thus, the differentiation is mentioned at this point.
54 3 State of the Art
as on-line analytical mining (OLAM, cf. Han et al. (1999)) - it will only be a
matter of time until algorithms take hierarchies into account when search-
ing for patterns within time interval datasets.
3.2.3 Visual Analytics
The term visual analytics was coined by Pak Chung Wong, Thomas (2004).
In general, visual analytics has the purpose of analytical reasoning by us-
ing interactive visual interfaces (Thomas, Cook 2005). To create a good
interactive visual interface, Shneiderman (1996) stated that "a useful start-
ing point for designing advanced GUIs is the Visual lnformation-Seeking
Mantra: overview first, zoom and filter, then details on demand". In addition,
Shneiderman stated that a good visualization is task dependent. Thus, the
key task of an information system is to provide aggregated information in
real-time and requested filtered data on demand. To achieve that, a flexible
and performant data structure is necessary (cf. section 3.3). To enable the
creation of task dependent visual interfaces, it is also necessary that the
information system offers an interface to request and receive data (cf. VIS-
01, VIS-05). Several proprietary software tools are commonly used to cre-
ate such interfaces, e.g., Tableau©21, Google Fusion Tables22, or Datawrap-
per23. Nevertheless, several publications introduce new visualization tech-
niques in the field of time interval data analysis, so far unsupported by any
proprietary software.
Aigner et al. (2007) give an overview over the variety of techniques pre-
sented over the last years, useful to visualize time-oriented data, i.e., in-
cluding time interval data. One of the techniques presented in the context
of time interval data, is the Cluster Viewer introduced by van Wijk, van Se-
low (1999). The visualization shows a combined representation of daily pat-
terns and clusters, whereby patterns are shown as graphs and clusters are
shown on a calendar. Lammarsch et al. (2009) introduced an interactive
21 http://www.tableau.com/ 22 https://support.google.com/fusiontables/ 23 https://datawrapper.de/
3.2 Analyzing Time Interval Data: Different Approaches 55
visual method incorporating the structures of time within a pixel-based vis-
ualization called GROOVE (granular overview overlay). The visualization
enables the users to gain new insights into different temporal patterns by
interactively changing the order of granularities while keeping the same set
of granularities. Figure 3.4 shows examples of the two visualization tech-
niques.
Figure 3.4: Examples of the visualization techniques Cluster Viewer (van Wijk, van Selow 1999) and GROOVE (Lammarsch et al. 2009).
Regarding the handling of time-oriented data within the context of visual
analytics, Rind et al. (2013) developed a software library called
TimeBench24. The library provides data structures and algorithms to handle
time-oriented data in the context of visual analytics. TimeBench is available
as open-source project and the underlying data model is based on a dis-
crete, linear, bounded temporal model (cf. section 2.1.3). Furthermore, the
implementation utilizes relational data tables and time-specific indexing
structures to increase performance. As mentioned by the authors, it is "de-
signed mainly for developing research prototypes". Considering the perfor-
mance, the publication mentions runtime tests with up to 5,115 temporal
objects. Thus, the library has not been tested using larger datasets (i.e.,
several million temporal objects as the real life dataset used in section 8.2).
24 http://www.timebench.org
56 3 State of the Art
In general, different techniques (e.g., binned aggregations, statistical sum-
maries, or sampling) are used to realize real-time visualization of large da-
tasets (Liu et al. 2013). To apply these technique pre-aggregates are cal-
culated and held in memory. Thus, the possibility of calculating and provid-
ing pre-aggregates may be an important feature when applying visual an-
alytics on large datasets (cf. DC-03).
3.3 Performance Improvements
The performance of an implementation is typically improved by optimiza-
tion, i.e., using enhanced, faster, and specialized algorithms. In the case of
an information system useful for time interval data analysis, the algorithmic
part is one optimization criterion. However, the system - as information pro-
vider - has to ensure that the requested data is provided as fast as possible
(cf. 3.2.3). Thus, special data structures, i.e., indexes, have to be imple-
mented to ensure a fast data retrieval. In addition, the aggregation of data
is one of the pre-dominant operations used in the context of data analysis
(cf. section 2.1.2). Therefore, increasing the performance of aggregate
computation or providing pre-computing frequently used aggregates are
other possibilities to increase performance. Finally, caching strategies can
be applied to increase the performance.
In the following sections the current state of the art, regarding the men-
tioned capabilities available to increase the system’s performance, is intro-
duced. In section 3.3.1, different indexing technique used within the context
temporal data are introduced. In section 3.3.2, ideas on how to increase
aggregation performance are presented and in section 3.3.3 different cach-
ing strategies are discussed.
3.3.1 Indexing Time Interval Data
In general, an index is a data structure used to increase the query perfor-
mance when retrieving data from a dataset (or a database). Typically, the
increased performance for the retrieval decreases the performance when
3.3 Performance Improvements 57
inserting or updating data. The reason is the additional effort needed to
insert or update the index (i.e., the data structure) based on the added or
modified data. Depending on the type of data (e.g., primitive, strings, ob-
jects, key-value pairs, documents, spatial, temporal, or multimedia), the
storage type (i.e., main memory, secondary storage, clustered, or distrib-
uted), as well as the type of usage (e.g., mostly data retrieval vs. excessive
data updates/inserts) numerous data structures and handling strategies
(e.g., query optimization, pre-aggregates, or join-indexes) were presented
over the last decades (DeWitt et al. 1984; Chan, Ioannidis 1998; Gui et al.
2011; Garcia-Molina et al. 2014, pp. 333-360; 607-688). Regarding the field
of time intervals, several indexes were introduced enhancing the perfor-
mance when retrieving data using specific temporal operators. In general,
the different types of indexes can be categorized in tree-based or bitmap-
based.
Tree-Based Indexes
The IntervalTree (Edelsbrunner, Maurer 1981; Kriegel et al. 2001; Enderle
et al. 2004) is a tree-based data structure, which is optimized for overlap-
queries (i.e., which of these intervals overlap with a given interval). Never-
theless, the tree is capable to support all 13 temporal operators (Kriegel et
al. 2001). The relational implementation (Enderle et al. 2004) is based on
two B+-tree indexes (Bayer, McCreight 1972) and processes queries ap-
plying two steps. In a first step, the interval query is translated into several
range queries. Combining these queries to a single valid SQL query, which
is processed by the underlying DBMS, is the second, final step.
Another data structure introduced to store interval data is the Seg-
mentTree (Bentley 1977). The structure is based on a segmentation of the
underlying time axis (i.e., a partition of the time axis induced by the distinct
values of the intervals’ endpoints). Each node of the binary tree is a union
of its children. In general, the tree is optimized to perform contain-queries
58 3 State of the Art
(i.e., which of these intervals contain a given time point). Several optimiza-
tions for, e.g., higher dimensions or other temporal operators were pre-
sented during the last years (Berg et al. 2008; Dignös et al. 2014).
Bitmap-Based Indexes
In addition to the tree-based indexes, different bitmap-based indexes were
introduced within the field of data analysis and the area of DW, as well as
DSS. A bitmap is an array-like data structure containing 0 and 1. In general,
a 1 indicates that the entity associated to the position of the array is an
element of the set. A bitmap-index uses this feature, creating a bitmap for
each possible value of a property of an entity. Figure 3.5 illustrates a bit-
map-index for a color-property having three possible values: red, green, or
yellow. The bitmap-index indicates that the apple associated to position 3
(zero-based) is red.
Figure 3.5: Example of a bitmap-index containing three bitmaps, one for each possible value (i.e., red, green, and yellow) of the color-property.
Several different bitmap implementations have been published over re-
cent years, differing regarding the compression or encoding schemes. The
selection of the right schemes is crucial, considering the performance
gained and storage needed. Important criteria to select the best compres-
sion and encoding scheme are the queries to be expected (Chan, Ioannidis
1999), the order of data (Lemire, Kaser 2011), and the complexity consid-
ering the logic operations used within queries (Kaser, Lemire 2014).
3.3 Performance Improvements 59
Wu et al. (2009) implemented FastBit, a software tool used to query
scientific data efficiently using bitmap indices, out-performing popular com-
mercial DBMS in selected scenarios by a factor higher than ten. In addition,
several compression schemes based on run-length encoding (RLE) were
introduced, i.e., PLWAH (Deliège, Pedersen 2010), CONCISE (Colantonio,
Di Pietro 2010), EWAH (Lemire et al. 2010), and PWAH (van Schaik, Moor
2011). Recently, Chambi et al. (2015) presented a compression scheme
named Roaring based on packed arrays for compression instead of RLE.
Several evaluations indicate that Roaring can increase the performance by
a factor of 25 (Chambi et al. 2015; Meisen et al. 2015b).
Considering encoding schemes, Chan, Ioannidis (1999) introduced four
encoding schemes25: equality, range, interval, and membership encoding
scheme. The schemes define the constraints to be applied, i.e. equality:
v = propvalue, range: v ≤ propvalue, interval: v1 ≤ propvalue ≤ v2, and membership:
propvalue ∈ {v1, …, vn}. The presented encoding schemes are meant to be
used with discrete point data and are not directly applicable to time interval
data. In addition, Stockinger et al. (2004) developed evaluation strategies
to optimize the usage of bitmap-indexes for floating numbers by utilizing
binned bitmaps.
Considering temporal data, Roh et al. (2012) introduced an efficient bit-
map-based index for time-based interval sequences. The index aims to in-
crease the performance of similarity searches. Roh et al. assumes that an
interval sequence consists of non-overlapping and consecutive events, e.g.
phone calls handled by an operator. As mentioned and argued in section
3.2.1, this assumption is generally not valid. The first bitmap-based index
for time interval data was proposed by Meisen et al. (2015b). The index is
based on an array-like structure partitioning the time axis into its chronons
utilizing compressed bitmaps (Lemire, Kaser 2011) for each partition. The
index is presented and explained in detail in section 7.3.2.
25 The encoding scheme is the definition determining which of the bits are set to 1 in each
bitmap of an index.
60 3 State of the Art
3.3.2 Aggregating Time Interval Data
Aggregating data is one of the pre-dominant operations used in data anal-
ysis. To speed-up the execution of queries, techniques such as pre-com-
puting aggregates (Pedersen et al. 1999) or materialized views (Gupta,
Mumick 1999) have been proposed. In this section, techniques to increase
aggregation performance are introduced. The different aggregation tech-
niques (i.e., ITA, MWTA, STA, and TAT) are introduced in section 2.1.2.
In the field of temporal databases, Kline, Snodgrass (1995) presented
a data structures called AggregationTree, useful to store temporal aggre-
gates along pre-defined levels of the time dimension. Over the past years,
different enhancement, for the different forms of temporal aggregations (cf.
section 2.1.2), were presented (Zhang et al. 2001; Zimányi 2006; Zhang et
al. 2008; Gordevicius et al. 2012). Furthermore, other data structures like
the balanced Tree (Bongki Moon et al. 2003), SB-Tree (Yang, Widom 2003),
or multi-version SB-Tree (Zhang et al. 2001; Tao et al. 2004) were intro-
duced. Nevertheless, the solutions typically focus on one aggregation op-
erator (e.g., SUM(A)), do not support complex expression (e.g.
MAX(SUM(A + B)), cannot handle multiple filter criteria (e.g., aggregating
all red apples), or do not consider data gaps (e.g., missing values cannot
be handled). Böhlen et al. (2006) presented a tree-based implementation
for a temporal multi-dimensional aggregation technique (TMDA). The de-
fined TMDA operator supports ITA and MWTA aggregations, as well as dif-
ferent aggregation operators. Nevertheless, MODE or MEDIAN, along with
complex expressions are not supported. In addition, the presented imple-
mentation does not clarify how filter criteria are recognized.
Regarding the usage of bitmap-indexes, several publications introduce
the capability to speed up aggregate queries (Kaser, Lemire 2014). In ad-
dition, the result of an aggregation using bitmap indexes, can be easily kept
in memory and reused when applying further operations to the result, e.g.,
drill-down (Abdelouarit et al. 2013). Recently, Meisen et al. (2015b) intro-
duced a bitmap-based implementation for TAT (cf. section 2.1.2). The im-
3.3 Performance Improvements 61
plementation utilizes the bitmaps used for indexing, together with the logi-
cal and aggregation operators available for bitmaps (i.e., AND, OR, XOR,
NOT, and COUNT). The algorithm differs between three strategies depend-
ing on the property to be aggregated. A detailed explanation is presented
in section 7.3.4.
3.3.3 Caching Time Interval Data
Caching data can increase the performance of an information system dras-
tically. An important criterion is the frequent usage of the same data (e.g.,
the same query or the same data entity). In addition, incremental calcula-
tions (i.e., reusing a previous result) can be boosted by the utilization of a
cache. Research focuses on different aspects of caching, i.e., type of
caches (e.g., CPU, GPU, or main memory; cf. Handy (1998)), cache algo-
rithms (e.g., random replacement (RR), least recently used (LRU), or most
recently used (MRU); cf. Al-Zoubi et al. (2004)), or cache handling (i.e.,
coherence, coloring, or virtualization; cf. Hashemi et al. (1997), Sorin et al.
(2011)).
In the field of information systems, the focus considering caching is on
the utilized cache algorithm. As already mentioned, several different algo-
rithms were introduced over the last decades, defining which elements to
discard when the cache is full and new ones should be added. In general,
the most recent used algorithms are the ones already mentioned, i.e., RR,
LRU, or MRU. LRU and MRU are both algorithms which need a statistic to
be maintained and updated whenever an item is retrieved or discard from
the cache. In contrast, the RR algorithm does not need any additional im-
plementation effort when being utilized (Zhou 2010).
Regarding research, a specific caching strategy for time interval data is
not investigated and also not discussed in this book. Instead, concerning
this book, different caching implementations to ensure a fast retrieval from
secondary memory are discussed (cf. section 7.1.2), an extendable frame-
work is introduced (cf. section 7.3.3), and the use of cache algorithms is
discoursed (cf. section 8.2.2).
62 3 State of the Art
3.4 Analytical Query Languages for Temporal Data
A query language is generally utilized to retrieve data from, manipulate
data of, or define the schema of data contained in a dataset. In addition,
some statements defined within the query language may be used for au-
thorization purposes (e.g., grant access to a specific type of data) or or-
ganizational tasks (e.g., start and stop a transaction or use a bulk load).
Regarding temporal datasets, several query languages were defined within
the context of temporal databases, e.g., IXSQL (Lorentzos, Mitsopoulos
1997), ATSQL2 (Böhlen et al. 1995; Guo et al. 2010), SQL/TP (Toman
2000), or TSQL2 (Snodgrass 1995). More general query languages like
multidimensional expressions (MDX) defined for OLAP or the structured
query language (SQL) used in the context of relational databases, are often
used by analysts to solve analytical issues (Spofford 2006; Chamberlin,
Boyce 1976). Recently, a formal language for time interval data analysis
named TIDAQL was introduced by Meisen et al. (2015a). The language
TIDAQL is introduced and discussed in detail in chapter 5.
In the following, several statements are presented, each to retrieve the
needed resources for specific work-areas (i.e., being a work-area of the
department GH) and types within an hour of a specific day (i.e., the first of
January 2015). The statements are formulated using different languages,
i.e., MDX, ATSQL226, SQL, and TIDAQL. In addition, the issues arising us-
ing these types of query languages when analyzing time interval data is
explained27. The used database and the question to be answered are illus-
trated in Figure 23. The figure shows the intervals of the database for the
specified day (already filtered by the specified work-area for clarity) and the
expected answer, i.e., the needed resources for each hour of the day for
each work-area and task-type group (these are GH.Cleaning, long;
26 ATSQL2 is a query language supported by the currently only available temporal database system (TimeDB, http://www.timeconsult.com/Software/Software.html).
27 The processing performance is not considered as issue in this chapter. A detailed evaluation of the processing performance of different systems using different languages is presented in section 8.2.5.
3.4 Analytical Query Languages for Temporal Data 63
GH.Cleaning, average; and GH.Cleaning, short).
Figure 3.6: Illustration of the question to be answered by the query: "How many resources are needed within each hour of the first of January 2015?"
The MDX statement used to retrieve the data from a cube having a
TIME, ORGA (i.e., WORKAREA), and TASK (i.e., TASKTYPE) dimension
defined, as well as a simple count measure is shown in Listing 3.1.
Listing 3.1: MDX statement used to answer the question regarding the needed resources.
WITH
MEMBER [MEASURES.NEED] AS
MAX(DESCENDANTS([TIME].[RASTER].CurrentMember, , LEAVES),
[MEASURES].[COUNT]), FORMAT_STRING = '#.##'
SELECT
CROSSJOIN(FILTER([ORGA].[UNIT].[WORKAREA].Members,
INSTR([ORGA].[UNIT].CurrentMember.UniqueName,
'[ORGA].[UNIT].[All].&[GH.') > 0),
{[TASK].[DUR].[TYPE]}) ON COLUMNS,
CROSSJOIN([TIME].[RASTER].[DAY].Children, [MEASURES.NEED]) ON ROWS
FROM [GH_DATA]
The first part (i.e., WITH MEMBER) of the statement defines the measure
used to calculate the maximum of all count-values for all leaves of the cur-
rent member. The second part (i.e., SELECT) specifies the dimensions to
be selected in the result, which are the filtered work-area (i.e.,
[ORGA].[UNIT].[WORKAREA].Members) and the task’s type (i.e.,
[TASK].[DUR].[TYPE]) on the columns, as well as the hours (i.e.,
64 3 State of the Art
[TIME].[RASTER].[DAY].Children) on the rows. Besides the obvious com-
plexity of the statement, the following issues regarding the query language
should be considered:
– the query is not intuitive (regarding, e.g., the order or the combination
of members), i.e., only an expert may be capable to understand and
formalize it,
– the name based filter has to be applied using a special FILTER function
instead of being defined within the WHERE part, and
– the calculation of the measure is not intuitive and error-prone, i.e., the
selection of the children of the lowest granularity.
In addition, the result may be incorrect if summarizability problems occur,
which is the case if the used tool does not support non-strict relationships.
Figure 3.7 illustrates the incorrect (left side) and the correct result (right
side), using the sample dataset.
Figure 3.7: Comparison of the result of the query from a system supporting non-strict relationships (right) and one that does not (left).
The ATSQL2 language was defined in the field of temporal databases
as extension of SQL. The syntax distinguishes between temporal and
standard statement modifiers. The language itself does not support any
dimensional aspects and also no two-step aggregation. Thus, it is difficult
to realize the mentioned query. In addition, the only available tool (i.e.,
TimeDB) does not support all language features, e.g.:
– the supported aggregation forms are ITA and MWTA (i.e., constant in-
tervals).
– like expressions cannot be used as filter criterion,
3.4 Analytical Query Languages for Temporal Data 65
– order by was not applicable, and
– the tool did not consider multiple filter criteria for the same attribute,.
Nevertheless, Listing 3.2 shows the ATSQL2 statement determining the
different intermediate count-results for each minute and combining the in-
termediate results of an hour using max.
Listing 3.2: ATSQL2 statement used to answer the question regarding the needed resources.
NONSEQUENCED VALIDTIME
PERIOD [DATE 2015/1/1~00:00:00‐DATE 2015/1/2~01:00:00)
SELECT WORKAREA, TASKTYPE, MAX(VALUE)
FROM (
VALIDTIME PERIOD [DATE 2015/1/1~00:00:00‐DATE 2015/1/1~00:01:00)
SELECT WORKAREA, TASKTYPE, COUNT(VALUE) FROM GH_DATA
WHERE WORKAREA LIKE 'BIE%' GROUP BY WORKAREA, TASKTYPE
UNION
[…]
UNION
VALIDTIME PERIOD [DATE 2015/1/1~00:59:00‐DATE 2015/1/1~01:00:00)
SELECT WORKAREA, TASKTYPE, COUNT(VALUE) FROM GH_DATA
WHERE WORKAREA LIKE 'BIE%' GROUP BY WORKAREA, TASKTYPE
) HOUR_01 GROUP BY WORKAREA, TASKTYPE
UNION
[…]
UNION
NONSEQUENCED VALIDTIME
PERIOD [DATE 2015/1/1~23:00:00‐DATE 2015/1/2~00:00:00)
SELECT WORKAREA, TASKTYPE, MAX(VALUE)
FROM (
VALIDTIME PERIOD [DATE 2015/1/1~23:00:00‐DATE 2015/1/1~23:01:00)
SELECT WORKAREA, TASKTYPE, COUNT(VALUE) FROM GH_DATA
WHERE WORKAREA LIKE 'BIE%' GROUP BY WORKAREA, TASKTYPE
UNION
[…]
UNION
VALIDTIME PERIOD [DATE 2015/1/1~00:59:00‐DATE 2015/1/1~01:00:00)
SELECT WORKAREA, TASKTYPE, COUNT(VALUE) FROM GH_DATA
WHERE WORKAREA LIKE 'BIE%' GROUP BY WORKAREA, TASKTYPE
) HOUR_24 GROUP BY WORKAREA, TASKTYPE
66 3 State of the Art
The ATSQL2 query is not flexible regarding the selected dimensional level
and the time-window. In addition, writing such a query manually is signifi-
cantly difficult, because of the amount of statements to be united (i.e., one
for each chronon). Nevertheless, programmatically the query could be eas-
ily generated using a loop (i.e., iterating over the chronons and grouping
these for the selected dimensional level).
The next statement presented utilizes SQL to retrieve an answer re-
garding the needed resources. Listing 3.3 shows the statement, which is
based on additional PL/SQL functions and data types (cf. appendix: Pipe-
lined Table Functions (PL/SQL Oracle)). The statement creates a virtual
table (i.e, TABLE(F_DATES([…]))) containing all the chronons within a spe-
cific time-window. These chronons are combined with the descriptive val-
ues (i.e., WORKAREA and TASKTYPE) using full outer joins. The resulting
table is joined with the actual interval data, to be finally grouped in two
steps (i.e., first counting and then determining the maximum). The query
itself has to be strongly adapted whenever the descriptive values change
(i.e., instead of looking for work-areas and task types). Summarized, such
a statement may be formalized by an expert to retrieve some insights (as
mentioned, the performance is not considered at this point).
Listing 3.3: SQL statement used to answer the question regarding the needed resources. The presented solution is based on additional PL/SQL functions and data types which are shown in the appendix (cf. Pipe-lined Table Functions (PL/SQL Oracle)).
SELECT
"DATA"."HOUR" "HOUR", "DATA"."WORKAREA" "WORKAREA",
"DATA"."TASKTYPE" "TASKTYPE", MAX("DATA"."COUNT") "NEED"
FROM
(SELECT
META."START" "DATE", META."HOUR" "HOUR", META.WORKAREA "WORKAREA",
META.TASKTYPE "TASKTYPE", COUNT(1) "COUNT"
FROM
3.5 Similarity of Time Interval Data 67
(SELECT
WORKAREAS.WORKAREA "WORKAREA", TASKTYPES.TASKTYPE "TASKTYPE",
DATES.start_date "START", DATES.end_date "END",
TO_DATE(TO_CHAR(DATES.start_date, 'yyyy‐MM‐dd hh24'),
'yyyy‐MM‐dd hh24') "HOUR"
FROM
(SELECT DISTINCT WORKAREA FROM GH_DATA
WHERE WORKAREA LIKE 'GH.%') WORKAREAS,
(SELECT DISTINCT TASKTYPE FROM GH_DATA) TASKTYPES,
TABLE(F_DATES(
TO_DATE('2015‐01‐01', 'yyyy‐MM‐dd'),
TO_DATE('2015‐01‐02', 'yyyy‐MM‐dd'))
) DATES
) META LEFT OUTER JOIN GH_DATA INTERVALS ON
META."START" < INTERVALS."END" AND
META."END" > INTERVALS."START" AND
META.WORKAREA = INTERVALS.WORKAREA AND
META.TASKTYPE = INTERVALS.TASKTYPE
GROUP BY META."START", META."HOUR", META.WORKAREA, META.TASKTYPE
) "DATA"
GROUP BY "DATA"."HOUR", "DATA".WORKAREA, "DATA".TASKTYPE
ORDER BY "DATA"."HOUR", "DATA".WORKAREA, "DATA".TASKTYPE
Last but not least, the query using TIDAQL is formalized in Listing 3.4.
As mentioned, the language itself is presented in detail in chapter 5 and is
illustrated here for the sake of completeness.
Listing 3.4: The TIDAQL statement used to answer the question regarding the needed resources.
SELECT TIMESERIES OF
MAX(COUNT(TASKTYPE)) AS "NEED" ON TIME.RASTER.HOUR
FROM GH_DATA IN [2015‐01‐01, 2015‐01‐02)
GROUP BY WORKAREA, TASKTYPE INCLUDE {('GH.*')}
3.5 Similarity of Time Interval Data
DA-07 formulates the requirement that an analyst has to be able to find
similar situations within the provided dataset. To implement and fulfill the
68 3 State of the Art
requested feature, it is necessary to define what similarity means. Regard-
ing sets of temporal interval data, three similarity measures are defined:
(1) an implementation based on relations among the intervals named
ARTEMIS (Kostakis et al. 2011), (2) an approach based on dynamic time-
warping (DTW) (Kostakis et al. 2011), and (3) IBSM (Kotsifakos et al. 2013)
a similarity measure based on the count of so called active intervals. In the
following, the three different measures are introduced.
The similarity of ARTEMIS is defined on Allen's interval relations (cf.
section 2.1.4). ARTEMIS calculates the distance between two sets of de-
termined event-interval relations using the Hungarian algorithm (Kuhn
1955), i.e., the minimal assignment costs are defined as the distance. To
speed up the distance calculation Kostakis et al. introduce a lower-bound
for ARTEMIS, useful when searching for, e.g., the k-nearest neighbors
(k-NN). Figure 3.8 illustrates the calculation of the ARTEMIS distance.
Figure 3.8: The ARTEMIS distance calculated for two interval-sets S and T.
In addition, Kostakis et al. present a distance measure based on DTW
(cf. Keogh, Ratanamahatana (2005)). The DTW-based similarity is based
3.5 Similarity of Time Interval Data 69
on a sequence of vectors created for an interval set. The vector is based
on the start and end values of the intervals, i.e., the vector contains a 1 if
the interval is covering the chronon or 0 if not. Each interval has a specific
pre-defined position within the vector and a vector is created for each chro-
non a state change of an interval occurs (i.e., an interval starts or ends).
The distance of two vector sequences is calculated using the vector-based
DTW distance. Figure 3.9 exemplifies the calculation of the DTW distance
for two interval sets. The figure shows the determined vector sequences
and the mapping using the technique known as DTW.
Figure 3.9: The DTW distance calculated for two interval-sets S and T.
In 2013, Kotsifakos et al. presented IBSM (i.e., Interval-Based Se-
quence Matching). A set of intervals is represented by a matrix, which con-
tains the number of active intervals of a specific label for a chronon of the
discrete time axis. The distance between two sets is defined as the Euclid-
ian distance between the two matrixes. Figure 3.10 illustrates the calcula-
tion of the IBSM distance and the created matrixes.
70 3 State of the Art
Figure 3.10: Example of the IBSM distance calculated for two interval-sets S and T.
The results of the publications suggest that the application context is
important, to decide which similarity measure to use. ARTEMIS uses the
relations as an indicator for similarity, whereas the DTW vector-based ap-
proach compares the intervals point-by-point and out of their context. IBSM
considers explicitly the duration of the intervals for comparison and implic-
itly the relation. Nevertheless, each implementation by itself may be insuf-
ficient for a specific application context. In chapter 6 a combined, bitmap-
based similarity measure is introduced, which allows the user to weigh the
importance of the different factors, i.e., relation, duration, or group. The lat-
ter factor is not explicitly mentioned in any of the presented implementa-
tions, i.e., the label is assumed to be equal or not. Nevertheless, regarding
similarity it may be an important criterion to define how similar a label is,
e.g., by using dimensional information.
3.6 Summary
In this chapter, the state of the art regarding analytical information systems,
different approaches applied when analyzing data (i.e., OLAP, pattern and
3.6 Summary 71
association rule mining, as well as visual analytics), performance improve-
ments (i.e., indexes, aggregation techniques, and caches), query lan-
guages used to analyze time interval data, and similarity is presented.
The chapter forms the basis for the answers to the RQ presented in
section 1. In addition, it reveals the gaps regarding a holistic solution to
analyze time interval data and implicitly the steps needed to be performed
to close the identified gaps. On the one hand the requirements to apply the
different approaches available when analyzing data in general, must be
supported by the information system, i.e., data must be retrieved fast and
available in the needed form, summarizability must be guaranteed, and
generalizations, as well as specializations must be selectable. On the other
hand, performance improvements must be holistically applicable and the
system must provide a domain-specific query language, so that queries
are simply defined and understood easily. In the following sections, these
gaps are closed and a holistic solution in form of an information system
useful to analyze time interval data is introduced. The following section
deals with the basis to achieve this goal, a formal model of time interval
data.
4 TIDAMODEL: Modeling Time Interval Data
This chapter presents the answer to RQ2 "Which aspects must be covered
by a time interval data analysis model and how can it be defined". This is
achieved by defining a model based on the terms time interval, time interval
record, time interval dataset, descriptive value, descriptor, time axis, di-
mensions, descriptor hierarchy, and time hierarchy. These different terms
are categorized by the different elements of the tuple defining a
TIDAMODEL.
Definition 1: TIDAMODEL
A TIDAMODEL is a 4-tuple ,,, containing the time interval data-
base , the descriptors , the time axis , and the dimensions .
In the following sections, the time axis (section 4.1), the descriptors
(section 4.2), the time interval database (section 4.3), and the dimensions
(section 4.4) are defined. The definitions are motivated by the introduced
features requested for an analytical information system useful for time in-
terval data and the different aspects introduced in chapter 2 and 3. The
definitions follow the model defined in Meisen et al. (2014).
4.1 Time Axis
As motivated in section 2.1.3 a discrete, linear, bounded temporal model is
assumed for the context of time interval data analysis. Thus, the terms valid
time points, chronon, and data time points are defined as follows:
Definition 2: Valid time points, chronon, and data time points
The valid time points time are a finite, totally ordered set with relation .
A time point t ∈time is called a chronon28. In addition, data time points
28 The presented definition of a chronon is consistent with the definition of Dyreson et al. (1994, p. 55).
© Springer Fachmedien Wiesbaden GmbH 2016P. Meisen, Analyzing Time Interval Data,DOI 10.1007/978-3-658-15728-9_4
74 4 TidaModel: Modeling Time Interval Data
time are defined as the set of possible values representing time infor-
mation within the raw data. A single valid time point is typically denoted
by tin ∈time.
The definition of time could give the impression that an unbounded or
continuous temporal model29 is valid. This impression is correct, regarding
the raw data. Nevertheless, the definition of time ensures that data available
for the analysis are bounded and discrete (i.e., the set of valid time points
is defined to be finite). Based on the definitions of time and time the term
temporal mapping function is defined as follows:
Definition 3: Temporal mapping function
A temporal mapping function time is a function that relates each data
time point tin ∈time to a chronon t ∈time, i.e., time:time→time.
It should be mentioned that the implementation presented in section
7.3.1 always uses a UTC time zone on the lowest granularity and supports
other time zones by modeling an additional level within the dimensional
model (cf. section 4.4 and 7.2.1). Thus, the valid time points are assumed
by the system to be UTC based time points. Time points of other time zones
are mapped internally. The presented definition of a temporal mapping
function enables the realization of the feature requested as DA-06. In ad-
dition, the existence of a mapping function is closely related to the feature
request DI-03.
Prior to providing a formal definition of the term time axis, the term gran-
ularity has to be defined. The granularity is important information to realize
dimensional modeling (cf. section 4.4), as well as the features DA-01 and
DA-04. Without a granularity, the system cannot provide correct calcula-
tions required for aggregations. In addition, a roll-up to a higher level is
29 As argued in section 2.1.3 the usage of a continuous temporal model is, from an analytical point of view, not reasonable.
4.1 Time Axis 75
difficult to validate without knowing anything about the lowest granularity of
the system.
Definition 4: Granularity
The granularity tgrain is a unit of time. The information system has to pro-
vide a list of valid and supported units. In general, the following units
have to be supported: second, minute, hour, day, week, month, and
year.
The definition of a time axis is the basis for several feature requests and
further definitions presented in this chapter. As mentioned already, the fea-
ture requests DA-01, DA-04, DA-06, and DI-03 and the presented solutions
are closely related to the time axis definition. Thus, based on the definitions
presented in this section, the term time axis is defined as follows.
Definition 5: Time axis
A time axis is a 2-tuple (time, tgrain) containing the temporal mapping
function time used to relate the incoming data time points to the valid
chronons. In addition, the granularity tgrain specifies the unit of time of the
chronons.
Figure 4.1 illustrates an example of a time axis definition. The figure
shows a discrete, linear, bounded time axis containing values between 0
and 9 (cf. definition of time). In the example, a data time point is a timestamp
(in milliseconds) between 2000-01-01 00:00:00.000 and 2099-12-31
23:59:59.999 of the CET time zone. The defined mapping functions maps
each data time point, i.e. timestamp, to a value between 0 and 9. More
precisely, the timestamp is mapped to the "ones place" of the minutes of
the timestamp, e.g., 2000-01-01 10:56:12.432 CET is mapped to 6.
76 4 TidaModel: Modeling Time Interval Data
Figure 4.1: Illustration of a time-axis = (time,minute). The incoming data, i.e., timestamps (in milliseconds) between 2000-01-01 00:00:00.000 and 2099-12-31 23:59:59.999 from the time zone CET, are mapped to values 1-10 representing minutes.
4.2 Descriptors
As stated in the informal definition of a time interval (cf. section 2.1.1), prop-
erties are used to associate descriptive information to a time interval, e.g.,
to describe what was observed during the time. In this section the term
descriptor is defined, which is based on the definitions of the terms: de-
scriptive attribute, descriptive value, descriptor values, descriptive mapping
function, and fact function. In general, a descriptor is used to describe a
state, an observation, a statement, or a measurement being valid within
the time interval. Such a description can be defined by simple data type
(i.e., a string, a number, an integer, or a logical value). Nevertheless, the
incoming data may contain complex structures (e.g., arrays, lists, or ob-
jects) associating multiple values of the same property to an interval (e.g.,
for a task performed several qualifications, like speaking English, having a
driver license, being not pregnant, may be needed). The following defini-
tions of a descriptive attribute and a descriptive value cover these points.
Definition 6: Descriptive attribute and descriptive value
A descriptive attribute is a property defined by a label, naming the prop-
erty, and a set of possible values allowed for the attribute. In general, a
not further specified descriptive attribute is denoted by i, whereby a
4.2 Descriptors 77
named descriptive attribute is referred to by using the label, e.g., the
descriptive attribute gender is denoted by gender= {male, female}. A
value of a descriptive attribute is called a descriptive value of the attrib-
ute, i.e., in ∈ i.
From an analytical point of view, possible complex structures have to be
mapped to (multiple) simple data types (cf. feature request DI-01 and sec-
tion 3.2.1), so that the analytical information system is capable to answer
queries correctly. For example, assuming a descriptive attribute qualification,
defined as the power set of all possible qualifications, i.e., qualifica-
tion = ℙ({cleaning, fueling, check-in, English, French, German}) and a task
requiring the qualifications specified by the descriptive value {cleaning,
English}. If the user queries for all tasks requiring the qualification cleaning,
the system is not capable to reply correctly, without understanding that the
descriptive value is described by a set. Thus, the following formal definition
of descriptor values is presented.
Definition 7: Set of and descriptor value
i denotes the set of descriptor values of the descriptive attribute i. As
in the case of descriptive attributes, a labeled set of descriptor values is
denoted by the specified label, e.g., gender. A descriptor value ∈i is
an atomic entity, i.e., a comparable30 and not divisible data type or struc-
ture. In addition, the value has to be referable by a unique name, i.e.,
useful as unique identifier.
To bring descriptor and descriptor values together, a mapping function
is necessary. A descriptive mapping function is defined in the context of a
descriptive attribute i. It is used to map a descriptive value in ∈ i to a
subset of the defined descriptor. The formal definition is as follows:
30 At least comparable regarding equality, i.e., a relation exists.
78 4 TidaModel: Modeling Time Interval Data
Definition 8: Descriptive mapping function
A descriptive mapping function i of a descriptive attribute i and the
set of descriptor values i is defined as i: i →ℙ(i). A descriptive map-
ping function of a labeled descriptive attribute (e.g., gender) is denoted
by using the label as annotation (e.g., gender).
As motivated, the function maps a single descriptive value to a subset of
descriptor values. This enables the system to support many-to-many rela-
tionships as requested by feature DI-01. In addition, the feature request DI-
02 is covered by the existence of a mapping function, which can also be
used for validation, transformation, or cleansing purposes.
To enable data aggregation along a descriptor value (or a specified sub-
set of descriptor values), it is necessary to associate a numeric value to a
specific descriptor value. For example, assuming a descriptor value squad
∈ groupSize, someone would expect that the value 8 is aggregated for each
data element being described as a squad31. On the other hand, a descriptor
value ∈personnelNr= {00001, …, 99999} would be related best to a constant
fact of 1, e.g., to sum up the number of resources needed. Last but not
least, assuming a descriptive attribute temp ≙ , the descriptor values
temp= {high, middle, low}, and the descriptive mapping function
temp: temp →ℙ(temp)defined by temp(v)= {low} for v < 30, temp(v) = {middle}
for 30 v < 60, and temp(v)= {high} otherwise. An aggregation based on
temp, e.g., MEAN(temp), should aggregate the raw values, i.e. the descrip-
tive values.
Thus, when aggregating data the grouped data is combined based on
a defined aggregation function and an attribute specifying the values to be
aggregated. Therefore a fact function is introduced, used to specify a fact
value for a specific descriptive or descriptor value. Based on the previous
31 The typically group size of a squad is considered to be 8.
4.2 Descriptors 79
example, three different types of fact functions are introduced: value-invar-
iant, record-invariant, and record-variant. The implementation regarding
the aggregation of time interval data using these different fact functions is
presented in section 7.3.4.
Definition 9: fact function (value-invariant, record-invariant, record-
variant)
A fact function i is a function defined for a descriptive attribute i. A
value-invariant fact function relates every descriptor value ∈ i to a
constant number, i.e., i() = n, with n ∈ .Arecord-invariant fact func-
tion relates each descriptor value ∈i to a specific number, i.e.,
i→ . Finally, a record-variant fact function is defined by i:(i,i)→ .
The latter relates a 2-tuple, containing the descriptor value and descrip-
tive value, to a fact.
Based on the definition of a descriptive mapping function i, a set of
descriptor values i, and a fact function i the term descriptor is defined as
follows:
Definition 10: Descriptor
A descriptor di is a 2-tuple (i, i) containing the descriptive mapping
function i used to relate elements of the descriptive attribute i, i.e.,
descriptive values, to an element of the descriptor values i. In addition,
the tuple contains the fact function i, which is used to relate a de-
scriptor value to a number. Furthermore, is defined as the set of all
descriptor elements of the model.
Figure 4.2 illustrate a descriptor dlang. The descriptor describes lan-
guages spoken by persons and maps each language to the constant fact
1, using a value-invariant fact function. The descriptive mapping function
used in the example is the identity function, i.e., it maps each element of
80 4 TidaModel: Modeling Time Interval Data
lang to itself. Thus, regarding the example, langℙ(lang). Modifying the ex-
ample by assuming that lang contains a set of 2-tuples defining the lan-
guage spoken and a skill-level, i.e., {(German, 1.0), (English, 0.9),
(French, 0.2)}, exemplifies the need for a record-variant fact function.
Questions like "What was the minimal skill-level of the French speaking
speakers during 10:00 – 11:00" could be answered. Regarding the latter
example, it is necessary to modify the mapping function as well, so that a
set of tuples is mapped to a set of languages, e.g., {(German, 1.0), (Eng-
lish, 0.9), (French, 0.2)} would be mapped to {German, English, French}.
Figure 4.2: Example of a descriptor dlang = (lang, lang, lang), which uses an identity function to map the set of languages, i.e., the descriptive values, to the descriptor values.
4.3 Time Interval Database
This section aims to define the structure and modeling of the time interval
data, handled by the information system. To achieve this, the term time in-
terval is introduced formally, following the definition presented in section
2.1.1.
Definition 11: Time interval
Based on the definition of a time axis = (time, tgrain) a closed time inter-
val is defined as a subset of time denoted by [tstart, tend] and defined as
[tstart, tend] ∈ { t|t ∈time,tstart t tend}. In addition, an open time interval is
4.3 Time Interval Database 81
denoted by (tstart, tend) and half-open intervals are denoted by [tstart, tend), or
(tstart, tend].
It should be stated, that any half-open or open interval can be, because of
the discrete time axis, transformed to a closed interval by excluding the
open endpoint(s), i.e., (tx, tx+n) ≡ [tx+1, tx+n-1], [tx, tx+n) ≡ [tx, tx+n-1], and
(tx, tx+n] ≡ [tx+1, tx+n]. Thus, when generally using the term time interval, a
closed time interval is assumed.
As mentioned in previous sections, the time interval alone is of no rele-
vance for analytical purposes. An important asset is the descriptive infor-
mation unfolding what was observed, measured, stated, or collected. Thus,
a data model combining the temporal with the descriptive information is
needed. Thus, a time interval dataset is introduced to define the structure
of data.
Definition 12: Time interval dataset and time interval record
A time interval dataset data is defined as a subset of time × time×1 ×
… ×n. with the data time points time and the different descriptive at-
tributes i. An ordered tuple ∈ data is called a time interval record. The
objects of a time interval record are denoted by (start, end, 1, …, n). In
addition, the objects start and end form a valid time interval [start, end].
Based on the definition of a dataset, the definition of a time interval da-
tabase can be formulated.
Definition 13: Time interval database
A time interval database is a tuple (data,time,1, …, n), containing
the time interval dataset data, data time points time, and the descriptive
attributes i. Thus, a time interval database contains all data added to
the information system, and the possible values of the different descrip-
tive attributes and data time points.
82 4 TidaModel: Modeling Time Interval Data
Figure 4.3 shows an example database. Each time interval record of the
dataset stands for a task performed by a team for a department. The pos-
sible values of the descriptive values are specified by the respective de-
scriptive attribute team and department. Furthermore, the possible incoming
data time points are defined to be on second granularity and within the year
2010, cf. time.
Figure 4.3: An example of a time interval database = (data, time, team, department). The database contains tasks performed by teams (a team consists of several team members) and for the specified department.
4.4 Dimensional Modeling
Regarding the dimensional modeling introduced by Cood et al. (1993), a
dimension consists of hierarchies, which contain different levels, which
themselves are defined by their members. In addition, the different relations
(i.e., generalization or specialization) are specified. Several publications
stated that it is important to avoid summarizability problems when modeling
a dimension (Lenz, Shoshani 1997; Mazón et al. 2008, 2009, 2011; Niemi
et al. 2014). Nevertheless, many-to-many relationships between members
of different levels exist in real-life scenarios. Thus, the conceptual model
should not handle a many-to-many relationship as problems. Instead,
these problems have to be solved on a lower level of modeling (i.e., within
the logical or physical model by adding intermediate levels, bridging tables,
or denormalization, cf. Song et al. (2001)). However, the solution presented
4.4 Dimensional Modeling 83
in section 7.3.2 avoids any summarizability issues and ensures correct ag-
gregation when rolling-up or drilling down.
In this section, a dimensional model for descriptors, as well as the time
axis is defined. The time dimension is thereby regarded as an exceptional
case, because of the special characteristics of time (cf. section 2.1.6). First,
a descriptor’s dimension is defined following Meisen et al. (2014).
Definition 14: Descriptor dimension, hierarchies, levels, and mem-
bers
A descriptor dimension i of a descriptor di = (i, i)is a non-empty finite
set of descriptor hierarchies, i.e. i = { h1, …, hm }, whereby a descriptor
hierarchy hk is defined as a 3-tupel (V, G, L) satisfying the following
statements:
V denotes the set of members and i is a subset of V. The members
not being a descriptor value are denoted by V' := V \ i.
G is a directed acyclic graph G := (V, V × V) denoting the relationsamong the members of the hierarchy. rG denotes the one member
v ∈ V satisfying ∃!v ∈ V' : deg+(v) = 0. Additionally, G satisfies
∃v ∈ i : deg–(v) = 0 and ∀v ∈ V' : deg–(v) > 0. These assumptions
ensure that exactly one sink (a.k.a. root) exists, that this root isreachable from every member, and that every source (a.k.a. leaf) is
a descriptor’s value, i.e., is an element of i.
L specifies the hierarchy’s levels and is defined as a partially ordered
partition of V with binary relation ≼G and {rG} • L. Additionally, L sat-
isfies:
∀l1, l2 ∈ L, l1 ≺G l2 : (∀n1 ∈ l1, n2 ∈ l2 : max-dist(rG, n1) > dist(rG, n2) ⋀∃n1 ∈ l1∃n2 ∈ l2 : dist(n1, n2) ≠ ∞ ⋀ ∀n2 ∈ l2 ∄n1 ∈ l1 : dist(n2, n1) ≠ ∞)
This assumption guarantees that the descendant of a level (accord-
ing to the partial order ≺G) increases the distance to the root and
there exists at least one node of a level, which has a path to a prec-
edent level.
84 4 TidaModel: Modeling Time Interval Data
Figure 4.4 shows two descriptor hierarchies. Each is defined for a dif-
ferent descriptor, i.e., the one on the left is defined for a descriptor having
countries as descriptor values, whereby the one on the right is defined for
cities. Both hierarchies are valid according to the definition provided, i.e.,
only one sink exists, the leaves are elements of the descriptor values, and
each member of a level has a successor decreasing or keeping the dis-
tance to the sink.
Figure 4.4: Example of two descriptor hierarchies. The one on the left is based on the descriptor values specified by country and the one on the right is based on city. The example shows a non-strict (left) and a non-covering hierarchy (right). Both hierarchies are valid regarding the definition of descriptor hierarchies.
Next, a dimensional model for the time axis is introduced. As already
mentioned, the dimensional modeling of time is considered to be an ex-
ceptional case, because of the special characteristics of time. A chronon of
the time axis may contain additional information implicitly recognized. In
addition, when moving up a hierarchy, this implicit information may be in-
valid. Figure 4.4 illustrates the implicitly recognized information and the va-
lidity of information when rolling up the hierarchy, e.g., the 2000-01-06 is a
regional holiday, which does not apply on the month level for the member
January. When defining a hierarchy for the time dimension, the implicitly
recognized information may be taken into account, e.g., by specifying a
holiday level.
4.4 Dimensional Modeling 85
Figure 4.5: Example of implicit information recognized for the timestamp 2000-01-06 13:00 CET and the validity of the information when roll-ing up a hierarchy.
In addition, it must be possible to define the time zone32 a hierarchy ap-
plies to (cf. DA-06). When analyzing data across different time zones, it is
necessary to analyze data from a time zone perspective, as well as a
global, i.e., UTC, perspective (cf. section 2.1.6). Furthermore, it should be
mentioned that the implicitly recognized information may differ depending
on the time zone (cf. Figure 4.5, January is not a month of winter in every
time zone). Figure 4.6 illustrates three hierarchies and the different infor-
mation depending on the time zone. The time axis is based on the UTC,
whereby two of the three hierarchies use a different time zone, i.e., PDT
and CET. Thus, the value of "part of day" changes according to the time
zone. This observation also applies to the "type of day" value, which is set
to "school holiday" for the specified region "Poland, CET".
32 In addition, the region may be important information as well. However, the region has no impact on the time. Thus, it can be recognized by labeling the hierarchy, e.g., hGermany, CET.
86 4 TidaModel: Modeling Time Interval Data
Figure 4.6: Example of three different hierarchies for a time-axis. The values of the shown hierarchies differ, based on the time zone selected and the region utilized.
Definition 15: Time dimension, hierarchies, levels, and members
A time dimension time of a time axis = (time, tgrain) isanon-empty set of
time hierarchies, i.e. time = { h1, …, hm }, whereby a time hierarchy hk is
defined as a 3-tupel (N, T, L) satisfying the following statements:
N denotes the members of the hierarchy. The chronons of the
time axis are a subset of N, i.e., time ⊂ N.
T is a rooted plane tree T := (N, N × N), defining the relations amongthe members of the hierarchy. In addition, the depth of all leaves isequal and denoted by Tdepth. Furthermore, the set of all nodes ofdepth k is denoted by Nk and, to be consistent33, T is directed towardsthe root. The leaves of the tree specified by NTdepth are the chronons
of the time axis, i.e., NTdepth ≡ time.
L is a totally ordered partition of N, i.e., L ≔ {Nk | 0 ≤ k ≤ Tdepth} and
defines the levels of a time hierarchy. The relation is denoted by ≺T
33 A hierarchy of a descriptor dimension is also directed towards the root.
4.5 Summary 87
and defined as NTdepth ≺T … ≺T N1 ≺T N0. In addition, a total order for
each set Nk with 0 ≤ k < Tdepth is assumed and for NTdepth the total order
defined for time is applied.
The presented definition does not imply an explicit declaration of a time
zone. Nevertheless, the definition supports multiple hierarchies, e.g., one
hierarchy for each time zone needed. Supporting multiple hierarchies also
allows to support different hierarchies for the same time zone, but different
regions (e.g., it is possible to define a hierarchy explicitly for the region
"Bavaria, Germany, CET" and another one for the region "Hesse, Germany,
CET"). Figure 4.6 outlines three time hierarchies defined for the UTC, PDT,
and CET time zone. The hierarchies of the UTC and PDT time zone are
equal, except the additional level needed to map the UTC chronons to the
PDT time zone. Within the example the hierarchy defined for the CET time
zone uses a different hierarchy (i.e., after the mapping to the time zone a
"type of day" level is utilized).
Definition 16: Dimensions
The dimensions are defined as the set containing all descriptor di-
mensions (i.e., a maximum of one dimension per defined descriptor)
and a maximum of one time dimension, e.g., = { time, 1, …, n}.
4.5 Summary
Summarized, this chapter presented the TIDAMODEL, which is the answer
to RQ2 "Which aspects must be covered by a time interval data analysis
model and how can it be defined". The model is based on four aspects, i.e.,
– the time interval database: defining data pushed into the system,
– the time axis: modeling the discrete, linear, bounded temporal model,
– the descriptors: specifying the attributes (properties) describing the ob-
served, measured, or stated information, and
– the dimensions: defining the dimensional model for descriptors and the
time axis.
88 4 TidaModel: Modeling Time Interval Data
Figure 4.7 depicts the TIDAMODEL and the different elements. Besides
the mentioned elements, the figure illustrates a time interval data record
with n descriptors, one time dimension time, and one descriptor dimension
1.
Figure 4.7: Illustration of the TIDAMODEL showing all defined elements.
As already mentioned, the presented model is motivated by the features
listed in section 2.2, the characteristics of time (cf. section 2.1), and the
literature research regarding time interval data analysis (cf. chapter 3). Be-
low, several feature requests are enumerated and their impact, relating to
the definitions, is explained:
– DA-01 influenced the definition of the time axis, i.e., the definition of
chronons, the provision of a total order, and the mapping function.
– DA-03 was considered when specifying the time interval database, i.e.,
raw records have to be available, as well as the time axis, i.e., regard-
ing the support of temporal operators.
– DA-04 motivated the definition of the dimensional model, in particular
the modeling of the time dimension.
– DA-05 explains the need for descriptor dimension.
4.5 Summary 89
– DA-06 was extensively discussed in this section. The support of multi-
ple hierarchies and the understanding of time zones are important as-
pects for the implementation of the model (cf. section 7.3.1).
– DC-03 did not have immediate impact. Nevertheless, the dimensional
model defined was reviewed regarding the fulfillment of this require-
ment, i.e., if pre-aggregates may be applied.
– DI-01 forces the descriptive mapping function to relate descriptive val-
ues to a set of descriptive values.
– DI-02 motivated the introduction of a descriptive mapping function.
However, the implementation provides additional strategies to define
default behaviors (cf. section 7.2.1).
– DI-03 was recognized within the time axis definition, i.e., to support
such strategies, the time axis must provide needed information like
boundaries must be known, intervals must be verifiable.
– DI-04 is partially covered by the existence of mapping functions. Nev-
ertheless, as introduced in section 7.2.1 additional solutions are avail-
able.
5 TIDAQL: Querying for Time Interval Data
A query language allows the user to access data of the information system,
e.g., for further processing, visualization, for backups, or to test a hypothe-
sis by additional analysis. In any way, the acceptance of a query language
depends on several design criteria. Snodgrass (1995, pp. 282–284) intro-
duced six measures useful to make appropriate design decisions when
specifying a language: expressive power, consistency, clarity, minimality,
orthogonality, and independence. In addition, Catarci, Santucci (1995)
added the criterion ease-of-use. Table 5.1 lists the criteria and gives a short
description.
Table 5.1: Overview of the seven criteria used as basis for design decisions re-garding a query language.
Criterion Description
expressive power
The language must be suitable for its intended applica-tion and should not "impose undesirable restrictions".
consistency The syntax should be "internally consistent" and sys-tematically extendable. In addition, it should be inspired by standards.
clarity The syntax should "clearly reflect the semantics" and facilitate "formulating and understanding queries".
minimality The syntax should only add "as few as possible new reserved words".
orthogonality The reasonable numbers in a design are zero, one, and infinity. Thus, "it should be possible to freely combine query language constructs that are semantically inde-pendent".
independence Each function should be "accomplished in only one way".
ease-of-use The query language should be "closer to the user view of the reality". It should be "attractive and graspable". In addition, it should fit to the user’s knowledge and ex-pectation.
© Springer Fachmedien Wiesbaden GmbH 2016P. Meisen, Analyzing Time Interval Data,DOI 10.1007/978-3-658-15728-9_5
92 5 TidaQL: Querying for Time Interval Data
Besides the features requested regarding a query language (cf.
DA-01 – 05, DA-08, PD-02, DC-02), the criteria of Catarci, Santucci, and
Snodgrass are used as guideline. In the sections of this chapter, the time
interval data analysis query language (TIDAQL) is described. Meisen et al.
(2015a) outlined selected features of the language, which are introduced
in this chapter in detail. Furthermore, additional language elements, like
analytical results, are presented.
Following the SQL language, the statements of the language are cate-
gorized in three groups: data control language (DCL), data definition lan-
guage (DDL), and data manipulation language (DML). The chapter is di-
vided according to this classification, i.e. DCL is introduced in section 5.1,
the DDL is described in section 5.2, and the DML is presented in section
5.3.
5.1 Data Control Language
Currently, every system available within a network needs authorization and
authentication mechanisms, to ensure the correct and wanted usage of the
system. The DCL is used to control the access to data available. Addition-
ally it is used to define which statements are allowed to be queried by a
specific user or a user group. As mentioned in section 2.2, specific features
considering the security aspects of the system were not listed. However,
during the workshops several requirements were specifically formulated.
With regards to the DCL, two important aspects were mentioned: (1) the
existence of security mechanism, e.g., grant and revoke permissions, sup-
port roles, or delete users; (2) the permissions must be grantable for a spe-
cific model or on a general level, e.g., a user group should not be able to
add intervals to a specific model, but should be capable to generally select
data. Applying the design criteria mentioned, the presented DCL is close
to the one known from SQL. Thus, the commands: ADD, DROP, MODIFY,
GRANT, REVOKE, ASSIGN, and REMOVE are defined within the lan-
guage.
5.1 Data Control Language 93
To add a user or a role to the system an ADD command is provided.
The syntax of statements using the command is shown in Listing 5.1. When
adding a user, a name and a password must be declared. In addition, per-
missions can be granted and roles can be assigned to the created user. A
role is added by providing a name and, if needed, a comma separated list
of permissions. The query does not define the syntax of a permission, i.e.
any string is allowed. Nevertheless, a concrete implementation may vali-
date if the assigned permission is known and specify what kind of permis-
sions are allowed (e.g., wildcards may be supported, to grant all permis-
sions of a specific model to a user: 'MODEL.myModel.*' 34).
Listing 5.1: Syntax of statements using the ADD command of the DCL to add a user or a role.
ADD USER 'name' WITH PASSWORD 'password'
[WITH PERMISSIONS 'permission1' [, 'permission2', ...]]
[WITH ROLES 'role1' [, 'role2', ...]]
ADD ROLE 'name' [WITH PERMISSIONS 'permission1' [, 'permission2', ...]]
It may be necessary to drop a created user or role. In that case, the
DROP command can be utilized. The syntax of statements is given in List-
ing 5.2. In general, a user or a role should be droppable at any time. It
depends on the processing, if a logged in user can be dropped or if the
session has to be closed prior to a deletion. The same applies for a role,
which might be assigned to a logged in user.
Listing 5.2: Syntax of statements of the DCL, used to drop a user or a role.
DROP [ROLE|USER] 'name'
The modification of a role or a user is limited to specific values, i.e., the
name of a role or a user cannot be modified. Thus, the only value that can
34 myModel is an example for a unique identifier of a model loaded into the system (cf. section
7.2.1).
94 5 TidaQL: Querying for Time Interval Data
be modified within the DCL is the user’s password. One may argue that
granting or revoking a permission from a user or role is also a modification.
However, granting and revoking of permissions are processes, which are
logically separated from the modification of attributes from an entity. Thus,
the DCL introduces different commands to revoke and grant permissions,
namely REVOKE and GRANT. Listing 5.3 shows the syntax of statements
for all three commands, useful to modify a user’s password and grant or
revoke a permission from a user or a role.
Listing 5.3: Syntax of the statements using the commands MODIFY, GRANT, and REVOKE.
MODIFY USER 'name' SET PASSWORD = 'name'
GRANT 'permission1' [, 'permission2', ...]] TO [ROLE|USER]
REVOKE 'permission1' [, 'permission2', ...]] FROM [ROLE|USER]
The last commands of the DCL introduced are used to assign and re-
move roles from a user. When creating a user it is possible to assign spe-
cific roles to the user. However, so far it is not possible to assign new roles
to or remove a role from a user. Therefore, the commands ASSIGN and
REMOVE are presented in Listing 5.4. The syntax shows that the words
ROLE or ROLES are allowed. Tests have shown that the inexperienced
users tend to use the keyword ROLES instead of ROLE, when they assign
or revoke multiple roles at once. Regarding the ease-of-use criterion, both
keywords are valid according to the defined syntax35.
Listing 5.4: Syntax of statements for the commands ASSIGN and REMOVE, used to modify the roles assigned to a user.
ASSIGN [ROLE|ROLES] 'role1' [, 'role2', ...]] TO USER 'name'
REMOVE [ROLE|ROLES] 'role1' [, 'role2', ...]] FROM USER 'name'
35 The syntax also shows that the statement ASSIGN ROLE 'role1', 'role2' TO USER 'philipp' is valid. From a system perspective it does not matter if the statement is grammatically correct.
5.2 Data Definition Language 95
As mentioned in the beginning of this section and shortly discussed in
the context of the ADD command, one of the requests specified the kind of
permissions needed, namely permissions must be grantable on a global or
a model specific level. Even if not specified by the syntax, two different
types of permissions were implemented: GLOBAL.<permission> and
MODEL.<model>.<permission>. The first one is used to grant a permission
on a global level, e.g. the retrieval of data is generally allowed. The latter is
used to grant a permission for the specified model, e.g.,
MODEL.myModel.MODIFY would allow the user to modify the one and
only model with the name myModel.
5.2 Data Definition Language
The DDL is used for defining the TIDAMODELs available by the information
system. A model is defined by its database, time axis, descriptors, and di-
mensions (cf. chapter 4). Instead of defining statements to create or modify
each of these entities, the DDL provides three commands: LOAD,
UNLOAD, and DROP. The former command is used to load a specific
model by providing a definition-file, whereby the latter two are used to un-
load or delete a model. The UNLOAD command is used to remove the
model from memory, i.e., the model is not available anymore, but can be
loaded if needed. Instead, the DROP command removes all data belonging
to the model. Listing 5.5 shows the syntax of statements using the com-
mands.
Listing 5.5: Syntax of statements using the LOAD, UNLOAD, and DROP com-mands of the DDL.
LOAD [modelId|"modelId"|FROM 'location']
[SET autoload = [true|false] [, force = [true|false]]]
UNLOAD [modelId|"modelId"]
DROP MODEL [modelId|"modelId"]
96 5 TidaQL: Querying for Time Interval Data
As mentioned, the LOAD command can be used to load a model into
the system, by providing a location to a model-definition-file. In all other
cases, i.e., when providing a model-identifier like LOAD "myModel", the
model must be known to the system, i.e., must have been loaded from a
location before. Irrespective of whether or not the model was loaded from
a location or the system, additional properties can be set. These properties
are autoload (i.e., specifying if the system should load the model on start-
up) or force (i.e. specifies that the model has to be loaded from the location,
independent if another model with the same identifier exists already).
To utilize the statements to unload or drop a model from the information
system, a model-identifier has to be declared. When a model will be un-
loaded or dropped, depends on the implementation. E.g., it may happen
that a manipulation query is running, while another user fires a drop query
for the same model. Depending on the implementation, the drop may be
performed and an exception will be thrown, or the drop will be delayed until
all operations, dealing with the model, are handled. An implementation re-
garding these issues, as well as a definition of the model-definition-file is
presented in chapter 6.2.
5.3 Data Manipulation Language
A DML is used to insert, update, or select data from the database. Even if
the selection of data does not manipulate the persisted data directly, raw
data is manipulated (e.g., aggregated) during the processing. In this sec-
tion, the defined statements are divided in four groups. The first group con-
tains statements used to manipulate raw data, i.e. utilizing the INSERT,
DELETE and UPDATE commands (cf. 5.3.1). The second group, which en-
closes the GET and ALIVE command, defines statements useful to retrieve
metadata, e.g., like the defined models or dimensions, as well as the sys-
tems health (cf. 5.3.2). In section 5.3.3, statements utilizing the SELECT
command are introduced, useful to retrieve aggregated data along the de-
fined dimensions, raw data, as well as analytical results. The latter was
5.3 Data Manipulation Language 97
added to the DML to apply analytical functions, like data mining algorithms,
to selected groups of datasets (cf. section 2.2.1 and 2.2.2). Several feature
requests regarding the DML were already mentioned in section 2.2. In ad-
dition, the following subordinate features were requested: (1) the language
should provide a construct to enable a type of bulk load to increase insert
performance, (2) the language should support a construct to receive meta-
information from the system like the actual version, available users, or
loaded models, and (3) the syntax of the query language should support
intervals defined as open, e.g. (0, 5), closed, e.g. [0, 5], or half-opened, e.g.
(0, 5].
5.3.1 Insert, Delete, & Update Statements
In an analytical information system, the insertion of data is the most fre-
quently used statement to manipulate the raw data of the database. In gen-
eral, delete statements are performed much less and update statements
are rare. The reasons are clear, data is added to the system whenever the
interval is closed and the associated descriptive values are known. Adding
incomplete or uncertain time interval data to the system would affect the
quality of the analysis. Nevertheless, it occurs that added data is classified
as noise, e.g. by applying clustering algorithms, and therefore has to be
deleted. In addition, users may be able to update information, which was
assumed to be complete, within a source system. Thus, these updates
most be reflected within the information system.
Listing 5.6 illustrates the syntax of statements using the INSERT com-
mand. The statement specifies the identifier of the model, the structure of
the data to be added, and the values.
Listing 5.6: Syntax of statements using the INSERT command of the DML.
INSERT INTO [modelId|"modelId"] (id1 [, id2, ...])
VALUES (value1 [, value2, ...]) [,(value1 [, value2, ...]), ...]
98 5 TidaQL: Querying for Time Interval Data
The structure is defined by the identifiers of the descriptors, as well as the
reserved words [START] and [END], which specify the position of the tem-
poral start and end value (i.e., the interval). It is also possible to add a
minus (i.e, -) to specify the interval as open, e.g., [START-] or [END-]. An
example of a statement using the INSERT command exemplifies the men-
tioned aspects:
INSERT INTO myAppleObservations
(COLOR, CLASS, [START], WEIGHT, [END‐], FALL, DURATION)
VALUES ('red', '2', 09:45:12, '220', 09:45:48, '1.00', '0.45') .
The statement adds the time interval data used in the apple falling from
tree example (cf. section 2.1.1) into a model, which is loaded into the sys-
tem and named myAppleObservations36. It is noticeable that the temporal
information provided within the list of values does not use any apostrophe.
A temporal value is generally not marked and can be a date-time (the syn-
tax allows several different formats, i.e., ANSI INCITS 30-1997 (R2008),
NIST FIPS PUB 4-2, ISO 8601, and some non-standardized) or integer
value. The handling of integer values is defined by the time axis, i.e., the
semantical meaning of the number (cf. section 7.3.1). In the example the
interval is defined as half-open, i.e., [START, END). Thus, the system has
to interpret the temporal information 09:45:48 as 09:45:47 (assuming that
a second granularity is defined).
To add, e.g., several thousand time interval data records into the sys-
tem, a bulk load can be enabled. If the bulk load is enabled, the system
only updates indexes or persist data when needed, i.e., because it is run-
ning low on memory, until the bulk load is finished. Listing 5.7 shows the
syntax of the statement to enable (i.e., bulkload = true) and disable (i.e.,
bulkload = false) the bulk load.
36 The name of a model is specified within the configuration file of a model (cf. section 7.2.1).
5.3 Data Manipulation Language 99
Listing 5.7: Syntax of the statement to enable or disable bulk load for a model.
MODIFY MODEL [modelId|"modelId"] SET bulkload = [true|false]
The deletion of time interval data records added to the system is per-
formed using statements utilizing the DELETE command. The syntax of
such statements is illustrated in Listing 5.8. As shown, the declaration of a
record identifier is necessary. The deletion of records by filter criteria (e.g.,
as known from SQL) is not supported. As mentioned, the deletion of a rec-
ord is decided on a record level. Thus, the record identifier is known, e.g.,
by selection or from a result of an analysis.
Listing 5.8: Syntax of the statement to delete a specified record from a model.
DELETE recordId FROM [modelId|"modelId"]
Updating a time interval record is, as a delete statement, based on the
record’s identifier. Within an update statement, all information can be mod-
ified with the exception of the record’s identifier. The syntax of a statement
using the UPDATE command is illustrated in Listing 5.9. Unlike an insert
statement, an update statement can only include a single record. Thus, the
syntax only supports one value list.
Listing 5.9: Syntax of statements using the UPDATE command of the DML.
UPDATE recordId FROM [modelId|"modelId"] SET (id1 [, id2, ...])
VALUES (value1 [, value2, ...])
5.3.2 Get & Alive Statements
For an information system used in a productive environment, some addi-
tional non-data related information must be available. On the one hand,
these information may be provided by an API (e.g., via web interface using
JSON or libraries), on the other the user may want to use the information
within a report, a dashboard, or any other proprietary tool using a database
100 5 TidaQL: Querying for Time Interval Data
connection. To support the latter, the GET and ALIVE command are added
to the DML. Some may argue that such commands are not part of a DML.
However, read-only queries are often considered to be part of DML.
Listing 5.10 shows the available syntax for statements based on the
GET command. The language supports five different meta-information to
be retrieved. GET VERSION is used to retrieve the version of the infor-
mation system, GET MODELS provides a set of records containing the
available models, GET USERS returns a list of all users together with the
assigned permissions and roles, GET ROLES lists the roles and assigned
permissions, and GET PERMISSIONS responses with a set of all permis-
sions defined for the information system.
Listing 5.10: Syntax of statements using the GET command of the DML.
GET [VERSION|MODELS|USERS|ROLES|PERMISSIONS]
In addition, the availability of the system is of importance, e.g., to mon-
itor the services’ health. To provide a quick possibility to check the system’s
health, the ALIVE command is added to the DML. The system replies to an
alive statement with an empty set. If the system’s health is critical, the sys-
tem will not reply at all or throw an exception, which would lead to an ex-
ception on client side.
5.3.3 Select Statements
Most of the requested features mentioned regarding the analytical capabil-
ities of the information system, are dealing with select statements, e.g.,
several aggregation methods must be available (cf. DA-01, DA-02), the raw
time interval data records must be retrievable (cf. DA-03), dimensional op-
erations like roll-up and drill-down must be provided (cf. DA-04, DA-05),
time zones must be supported (DA-06, DA-07), and analytical results must
be creatable (cf. DA-08). To satisfy especially the ease-of-use, consistency,
and clarity criteria the select statements are grouped into three types: time
series, records (i.e., raw data), and analytical result.
5.3 Data Manipulation Language 101
Select Time Series
Listing 5.11 outlines the syntax of a statement to retrieve time series from
the system within a specified time window. The query determines a time
series for each group and measure specified. In addition, it is possible to
retrieve a transposed time series, which is necessary for some third-party
tools or libraries, e.g., the JFreeChart37 library expects transposed time se-
ries and is used by several Java based reporting and business intelligence
tools38. Also, the statement specified the model to retrieve the data from, as
well as the interval. An interval can thereby be defined using open, closed,
or half-open notation. Depending on the time axis the values of the inter-
vals’ endpoints must be integers or date-time values, e.g.:
[5, 10], [13.10.1981, 08.04.2005), (2014/10/05 09:58:00, 2014/10/09 16:12:00) .
Listing 5.11: Syntax of the select statement to retrieve time series of a specified time window.
SELECT [TRANSPOSE(TIMESERIES)|TIMESERIES]
OF measureExpr1 [AS "alias1"] [, measureExpr2 [AS "alias2"], ...]
[ON timeDimensionalExpr] FROM [modelId|"modelId"] IN interval
[WHERE logicalExpr] [GROUP BY groupExpr]
The syntax of the statement to select time series is based on several ex-
pressions, not further explained so far: measure expression (i.e.,
measureExpr), dimensional expression (i.e., dimensionalExpr, timeDimen-
sionalExpr, or descDimensionalExpr), logical expression (i.e., logicalExpr),
and group expression (i.e., groupExpr). Prior to introducing these different
expressions, the syntax of the statement to select time interval records and
analytical results is introduced.
Select Records
Selecting records from the system is an important feature for analytical
purposes (e.g., data mining algorithms), as well as explanation, e.g., to
37 http://www.jfree.org/jfreechart/ 38 E.g., Pentaho (pentaho.com), JasperSoft (jaspersoft.com), or YellowFin (yellowfinbi.com).
102 5 TidaQL: Querying for Time Interval Data
help the analyst to understand the result of an aggregation by presenting
the involved records. Listing 5.12 shows the syntax of a statement to select
records from the information system. Instead of retrieving the raw records,
it is also possible to count or just retrieve the identifiers of the records.
Listing 5.12: Syntax of the select statement to retrieve time interval records from the information system.
SELECT [RECORDS|COUNT(RECORDS)|IDS(RECORDS)]
FROM [modelId|"modelId"]
[EQUALTO|BEFORE|AFTER|MEETING|DURING|CONTAINING|STARTINGWITH|
FINISHINGWITH|OVERLAPPING|WITHIN] interval
[WHERE [logicalExpr|idExpr]] [LIMIT int[, int]]
The syntax introduces ten temporal operators: EQUALTO, BEFORE,
AFTER, MEETING, DURING, CONTAINING, STARTINGWITH,
FINISHINGWITH, OVERLAPPING, and WITHIN. The interested reader
may notice that Allen introduced thirteen temporal relationships (cf. section
2.1.4). When using a temporal relationship within a query, the user is ca-
pable of defining one of the intervals used for comparison. Thus, the in-
verse relationships (i.e. inverse of meet, overlaps, starts, and finishes) were
removed, because these are not needed. Instead, the user is capable to
modify the self-defined interval. Furthermore, the WITHIN operator is
added to retrieve all intervals having at least one common chronon with the
time window. Figure 5.1 depicts the available operators and the relations
covered. In addition, an example is provided illustrating the intervals ful-
filling the query.
5.3 Data Manipulation Language 103
Figure 5.1: Illustration of the provided temporal operators and there correspond-ing temporal relation.
Regarding the utilized expressions, the select records statement uses a
logical expression or an identifier expression to filter the received set of
records. Within the next subsection, the statements to select analytical re-
sults are presented. Thereafter, the different expressions are introduced
and discussed in detail.
104 5 TidaQL: Querying for Time Interval Data
Select Analytical Results
Analytical results can be queries by using the ANALYTICALRESULT key-
word within a SELECT statement. An analysis is defined within the infor-
mation system, i.e., by providing a script or an implementation. The system
fires specified select time series or select records statements and streams
the result to the specified algorithm. In addition, parameters may be defined
to configure the algorithm. Listing 5.13 illustrates the syntax of the select
analysis statement. The algorithm is referred by named (cf. section 7.2.2)
or directly by specifying the full-qualified class.
Listing 5.13: Syntax of the select statement to retrieve analytical results from the information system.
SELECT ANALYTICALRESULT OF /statement1/ [, /statement2/, ...]
USING ['algorithm'|'class']
[SET param1 = 'value1' [, param2 = 'value2', ...]]
In the following, the different introduced expressions are defined and
examples are presented, starting with the introduction of the measure ex-
pressions introduced in the context of the statement to select time series.
Measure Expressions
Measure expressions are based on facts provided and associated with the
descriptors of the model (cf. section 4.4). An expression is defined by de-
scriptors, mathematical, and aggregation operators, e.g.:
SUM(DESC1 * (DESC2 / DESC3)) + MIN(DESC4) .
In general, the aggregation operator is not specified within the syntax of a
measure expression. The reason is extensibility regarding new operators.
The implementation presented in chapter 7.3.4 supports the definition of
new operators programmatically. These operators can directly be used
within the query language without any additional effort. In addition, a meas-
ure expression can also be applied for a specific dimensional level. To sup-
port the TAT aggregation technique presented in section 2.1.2, a second
5.3 Data Manipulation Language 105
aggregation operator can be specified, if and only if a dimensional expres-
sion is specified within the query39, e.g.:
MAX(SUM(DESC1 * (DESC2 / DESC3)) + MIN(DESC4)) + MIN(COUNT(DESC1)) .
The select time series statement supports the STA and TAT aggregation
technique for measures using levels to specify the partition of the time axis.
A time series cannot apply any aggregation of equal results along the time
axis, which would be done when applying ITA or MWTA. If so, the result of
the query would not be a time series, i.e., the result would not have a value
calculated for each time point of the time window. However, ITA and MWTA
can be calculated in linear time, by iterating over the sorted values (e.g.,
by using an analytical function introduced later in this section).
Dimensional Expressions
In general, a dimensional expression is used to refer to a defined level of a
dimension (cf. section 4.4). A user utilizes a dimensional expression to roll-
up (generalize) or drill-down (specify) the different levels of a hierarchy. De-
pending on the type of dimension expected, the expression can be speci-
fied to be a time or a descriptor dimensional expression. Independent of
the type is the syntax of such an expression, which is exemplified as fol-
lowing:
DimensionIdentifier.HierarchyIdentifier.LevelIdentifier .
The expression consists of three parts, each referring to a specified part of
the dimension using a unique identifier40. Figure 5.2 shows a sample di-
mension named "World" that is identified by WORLD. The illustrated dimen-
sion has two hierarchies of which only one is shown, namely the hierarchy
39 The TAT expects the specification of a partition of the time axis, cf. Figure 2.4. 40 Unique according to its context, i.e., the dimension’s identifier is unique among all dimen-
sions, the hierarchy’s identifier is unique among all hierarchies of the specified dimension, and the level’s identifier is unique among all the levels of the hierarchy.
106 5 TidaQL: Querying for Time Interval Data
Geographic location, identified by GEO. The hierarchy GEO has three lev-
els, i.e., World (identified by *), Country (identified by COUNTRY), and City
(identified by CITY). Each of the defined levels has at least one member.
Figure 5.2: Sample dimension showing one of two hierarchies with three levels.
Following the presented syntax of a dimensional expression, an expression
to select, e.g., the level named Country, would be:
WORLD.GEO.COUNTRY .
Logical Expressions
A logical expression is used within a select statement to filter the time in-
terval data records retrieved. The query language supports the following
logical connectives: AND, OR, and NOT. In addition, the system supports
the equal operator and the usage of parenthesis to formalize complex log-
ical expressions. Furthermore, to specify multiple values, wildcards are
supported by the equal operator, e.g.:
NOT(DESC1 = 'A*' OR DESC2 = 'LESS') AND DESC3 = 'VALID' .
The example shows a logical expression filtering data by the descriptor
values of the specified descriptors. In addition, it is possible to use dimen-
sional expressions as filter criteria. In that case, the information system
5.3 Data Manipulation Language 107
selects all time interval records, which have a member on the specified
level with the specified value, e.g., assuming the dimension shown in Fig-
ure 5.2 the following logical expression filters all intervals associated to the
USA:
WORLD.GEO.COUNTRY = 'COUNTRY_USA' .
Identifier Expressions
Using a logical expression to filter data does not provide any possibility to
select records by their identifier. To enable the user to do so, identifier ex-
pressions are introduced. An identifier expression specifies a list of identi-
fiers, which should be returned, e.g.:
[ID] = 1, 5, 7, 12 .
Group Expressions
Group expressions are used to specify the groups of data to be aggregated.
A group expression can be based on several descriptors or a level of a
dimension. It is also possible to specify several criteria to form a group,
e.g., assuming a model with two descriptors temp= {high, middle, low} and
gender= {male, female}, the following group expression would generate six
groups, namely (male, high), (male, middle), (male, low), (female, high),
(female, middle), and (female, low):
GENDER, TEMP .
As already mentioned, it is also possible to use a level of a dimension as
grouping criteria. For example, assuming a third descriptor the descriptor
city {Aachen, Cologne, Jacksonville, San Francisco} within our model
and the dimension depicted in Figure 5.2. The following group expression
generates ten groups ((Germany, male), (Germany, female), (USA, male),
(USA, female), (Vatican City State, male), (Vatican City State, female), (Un-
known, male), (Unknown, female), (France, male), and (France, female)):
WORLD.GEO.COUNTRY, GENDER .
108 5 TidaQL: Querying for Time Interval Data
A group expression generates all groups, independent if data is associated
to the group or not. To include or exclude specific groups, a group expres-
sion utilizes the include and exclude keywords, e.g.:
WORLD.GEO.COUNTRY, GENDER
include {('Germany', 'male')} exclude {('*', 'male')} .
The above example would select six groups, excluding all the groups con-
taining male, but because of higher priority, including the group ('Germany',
'male'). The higher priority of include is chosen, because of usability rea-
sons. Asked users stated, that a specified include is typically more specific
than a specified exclude, i.e., when both keywords are used, the include
keyword is used to define the values which should still be included, even if
the exclude keyword states otherwise.
5.4 Summary
In this chapter the TIDAQL was presented. The following overview lists the
feature requests involving or addressing aspects relevant for the query lan-
guage. As shown and argued, the query language covers the desired fea-
tures.
– DA-01 and DA-02 influenced the definition of the query language re-
garding the aggregation operators. The query language supports any
kind of aggregation operators to be applied. Thus, from a language
perspective the requirement is fulfilled. The processing of aggregation
operators is introduced in section 7.3.4.
– DA-03 requests the existence of a mechanism to retrieve raw time in-
terval data. Thus, the selection of records is added to the DML. Tem-
poral operators were explained and introduced in detail (cf. Figure 5.1).
– DA-04 and DA-05 formalize requirements regarding OLAP operators
(i.e., roll-up and drill-down). As introduced, the selection of time series
supports the usage of dimensions and therefore roll-up and drill-down
operations. Figure 5.3 illustrates the operations for a time dimension
5.4 Summary 109
(from lowest granularity (minutes) to hours) and a descriptor dimension
(from work-area to an organization type).
Figure 5.3: Usage of the query language features ON and GROUP BY to enable roll-up and drill-down operations.
– DA-08 and PD-02 require the definition of a SELECT command to re-
trieve time series, as well as analytical results. Thus, the part of the
DML covering these requests is based on the defined requirements.
– DC-02 requests the existence of INSERT and DELETE commands.
Both commands are introduced and part of the language (cf. section
5.3.1).
110 5 TidaQL: Querying for Time Interval Data
In addition, the introduced language follows the guidelines of Snodgrass
and Catarci, Santucci regarding the mentioned design criteria: expressive
power (e.g., covering the requested features), consistency (e.g., following
the SQL standard, which is well known by most analysts; using the same
keywords across different statements), clarity (e.g., all statements can be
easily understood even by non-experts41), minimality (e.g., most of the key-
words are well known from SQL; additional keywords increased readability
and therefore the ease-of-use and clarity of the language), orthogonality,
independence, and ease-of-use (e.g., adding synonyms for specific tokens
like ROLES or FILTER BY instead of WHERE).
TIDAQL is the answer to the third RQ "How can a query language for
the purpose of analyzing time interval data […] be formulated". The pre-
sented language is, as mentioned, designed to fulfill the formulated fea-
tures of analyst working with time interval data on a daily basis. Neverthe-
less, in the future further features will arise, and the presented language
has to adapt to these new requirements. In section 8.1, the fulfillment of
the different features is evaluated and user comments regarding enhance-
ments are shown.
41 The feedback of the inexperienced users during the development of the language was very positive regarding the readability.
6 TIDADISTANCE: Similarity of Time Interval Data
The similarity between time interval datasets or e-sequences, as named
by Kostakis et al. (2011) and Kotsifakos et al. (2013), is a domain-specific
measure. Thus, a flexible distance measure is needed to determine the
similarity between two sets of time interval data. So far, three similarity
measures have been introduced, i.e., DTW and ARTEMIS (Kostakis et al.
2011), as well as IBSM (Kotsifakos et al. 2013). As described in section
3.5, these measures are different regarding the produced results. However,
which of these three techniques is exact regarding similarity is context de-
pendent, even if Kotsifakos et al. (2013) describe IBSM to be the more
precise technique42. In general, three different types of similarity can be
distinguished: order similarity, measure similarity and relational similarity.
ARTEMIS is a similarity measure fitting into the category of relational sim-
ilarity, whereas IBSM and DTW are measures categorized as order simi-
larity. In specific, the order similarity is a special case of measure similarity,
using count as measure (both DTW and IBSM utilize count as measure).
However, for some domains the order similarity may be useful as a base
similarity needed to implicitly include, e.g., gaps between intervals. Figure
6.1 illustrates the different types and an example of equal datasets, i.e., the
similarity is 100 % or in other words the distance between the sets is 0.
Regarding an information system, the examples depicted in Figure 6.1
motivate the need for a context dependent configuration of a similarity
measure. In this chapter, a similarity measure combining order, measure,
and relational similarity is introduced. The user is capable to weight the
influence of the different similarities, depending on the context. In section
7.3.5, the bitmap-based implementation is explained, which, as shown in
section 8.2.4, outperforms DTW, ARTEMIS, and IBSM. In the following sec-
tions, the different types of similarities are defined by introducing a distance
42 Which is the case comparing IBSM and DTW. The DTW implementation has several "false
hits", because of the possibility to warp. Nevertheless, comparing IBSM with ARTEMIS is difficult, because the algorithm compare different aspects of time interval data sets. Thus, it is like comparing apples and oranges.
© Springer Fachmedien Wiesbaden GmbH 2016P. Meisen, Analyzing Time Interval Data,DOI 10.1007/978-3-658-15728-9_6
112 6 TidaDistance: Similarity of Time Interval Data
measure for each type, i.e., temporal order distance in section 6.1, tem-
poral relational distance in section 6.2, and temporal measure distance in
section 6.3. In section 6.4, the similarity measure used to combine the dif-
ferent distances is defined.
Figure 6.1: Overview of the different types of similarity types, presenting an equality example for each type of measure.
6.1 Temporal Order Distance 113
6.1 Temporal Order Distance
The temporal order similarity ensures that the intervals are ordered similar
according to the temporal order. Equally labeled intervals, which meet each
other are, regarding the temporal order, considered to be equal to one in-
terval covering the same time span (cf. Figure 6.1). Thus, the number of
intervals is not considered to be a criterion for similarity. Instead, the num-
ber of occurrences of equally labeled intervals at a specific time point is
used to determine similarity. To compare the different amounts at a specific
time point, it is important to define which time points are matched. Regard-
ing temporal data, this is mostly dataset, or more precise time axis, de-
pendent. A possible strategy is to compare the time points with the same
offset, i.e., the first amount is compared to the first amount of the second
dataset, the second with the second, and so on. However, other strategies
may be better suited, like starting the comparison on the first Monday.
Therefore, the definition must include a function to match time points. Fig-
ure 6.2 illustrates two different matching strategies, which may be utilized
depending on the domain. The weekday match is used to match the first
weekday (e.g., 2015-01-01 was a Thursday) to the first equal weekday
(e.g., the 2015-02-05 was the first Thursday in February 2015).
Figure 6.2: Illustration of two different matching strategies, i.e., weekday and or-der match.
114 6 TidaDistance: Similarity of Time Interval Data
In addition, the handling of unmatchable time points has to be defined (e.g.,
comparing daily values from January with values from June, the 31th value
cannot be matched with any time point from June). Several strategies may
be considered, e.g.:
– comparing the unmatchable time point with 0,
– ignoring the unmatchable time point at all (i.e., using a distance of 0),
– resizing the series using, e.g., bilinear interpolation (cf. IBSM), or
– using a special technique as matching strategy (cf. DTW).
However, regarding temporal data, the bilinear interpolation or the usage
of special technique like DTW is typically a bad choice. In general, when
comparing, e.g., months on a daily basis, it makes sense to ignore un-
matchable time points and consider only matching time points. Based on
this explanation, the definition of the temporal order distance is presented.
Definition 17: Temporal Order Distance
Let S and T be two sets of time intervals. Furthermore, let S and T be
the totally ordered sets of time points for each set and L be the set of
all labels (i.e., groups) defined. In addition, the function match:
S→T∪{null}is defined as the function used to map a time point of S
to a time point of T or null, if the value cannot be mapped. Let the func-
tion count: L × ( {S} × S, {T} × T ∪ {null} ) → 0 be the function used to
count the intervals with a specific label at a specific time point. The dis-
tance TODist between S and T is defined as
TO ∶ to l, t∈ , ∈S
with
to l, t ∶ count l, S, t count l, T,match t .
The definition covers the need for the possibility to specify a matching func-
tion (i.e., match function), as well as a possibility to define how to handle
6.2 Temporal Relational Distance 115
unmatched time points (i.e., count function). The match function also co-
vers the usage of an interpolation function. The DTW based distance pre-
sented by Kostakis et al. (2011) is not covered by this definition. Neverthe-
less, the results of the application of DTW within the context of temporal
order is questionable and IBSM showed that a fixed time point based ap-
proach achieves better results.
6.2 Temporal Relational Distance
The list of possible temporal relations between two intervals is presented
in section 2.1.4. As mentioned, several definitions of relations exist. There-
fore, the definition of a distance measure should not oblige any set of tem-
poral relations. Nevertheless, a specific set of temporal relations and the
possibility to determine a unique relation between two intervals has to be
selected to apply the distance measure. The algorithm to calculate the dis-
tance determines the relations of a provided dataset and compares it with
the relations of the second dataset. In the case of ARTEMIS the Hungarian
algorithm is applied to match the different relations between the intervals
of the two sets. The definition presented in this section, utilizes the temporal
order given by the time axis to define how a set of relations is matched with
another. A relation is thereby associated to time points. This ensures that
the distance is comparable to other time point based distances (like the
one introduced in this chapter). Thus, a vector of the count of all relations
can be determined for each time point. Figure 6.3 shows an example of
assignments of relations to time points.
116 6 TidaDistance: Similarity of Time Interval Data
Figure 6.3: Example of assignments of relations to time points using Allen's (1983) relations.
The figure exemplifies that a relation is associated to specific time points,
e.g., the overlap relation between A (4) and A (2) is associated to the time
points covered by [1, 4]. In addition, to avoid redundancy, only one of the
paired relations is recognized, e.g., instead of using the relations ends and
ends-by, only the relation ends is considered (cf. section 2.1.4, Figure 2.10).
Table 6.1 shows the formulas used to calculate the time points covered by
a relation.
Table 6.1: Overview of the time points calculation for a specific relation.
relation rel
covered time points A ≔ [a1, a2], B ≔ [b1, b2] with A rel B
overlaps [b1, a2] begins [b1, b2] includes [b1, b2] ends directly before [a2, b1] ends [b1, b2] equal [b1, b2] before [a2 + 1, b1 – 1]
As mentioned in section 6.2, the support of matching strategies, as well
as unmatchable time points should be covered by the distance. Thus, the
temporal relational distance is defined as follows.
6.3 Temporal Measure Distance 117
Definition 18: Temporal Relational Distance
Let S, T, S, T, L, and matchbe defined as stated in Definition 17. Fur-
thermore, the function reltype: L × ( S × S, T ×T ∪ null → 0be
the function used to count the relations of a specific type (i.e., overlaps,
begins, includes, ends directly before, or equal) with a specific label at
a specific time point. The distance TRDist between S and T is defined as
TR ∶ tr l, t∈ , ∈S
with
tr l, t ∶ rel l, S, t rel l, T,match t .type∈
,…,
6.3 Temporal Measure Distance
The measure distance between two sets of intervals is determined by cal-
culating the distance between each measure for each time point of a group.
Thus, the challenges mentioned in section 6.1, regarding the matching of
time points, as well as the handling of unmatchable time points, also ap-
plies to this measure.
Definition 19: Temporal Measure Distance
Let S, T, S, T, L, and matchbe defined as stated in Definition 17. In
addition, let the function measure: L × ( S × S, T ×T ∪ {null}) →
be the function used to determine the measure of the intervals with a
specific label at a specific time point. The distance TMDist between S and
T is defined as:
TM ∶ tm l, t∈ , ∈S
with
tm l, t ∶ measure l, S, t measure l, T,match t .
118 6 TidaDistance: Similarity of Time Interval Data
The definition of the temporal measure distance shows that it is a general-
ized version of the temporal order distance. However, as argued earlier,
using the count function as measure, implicitly adds several temporal as-
pects to the distance. In addition, the existence of a measure distance al-
lows the comparison of specific, e.g., business-related, measures (e.g.,
find a day with the same use of resources).
6.4 Temporal Similarity Measure
All presented distances measures support the usage of a matching func-
tion and the definition of unmatchable time points. Nevertheless, to com-
bine the different distance measure to a single similarity measure (cf. DA-
07), it is necessary that the different values are normalized. Thus, each
distance calculated for a specific label at a specific time point is normalized
using the maximal distance achievable.
Definition 20: Temporal Similarity Measure
Let S, T, S, T, L, and matchbe defined as stated in Definition 17. In
addition, let maxto, maxtr, and maxtm be defined as the maximal distance
possible for a specific label and time point, i.e.,
max l, t ∶ max count l, S, t , count l, T,match t ,
max l, t ∶ max ∑ rel l, S, ttype , ∑ rel l, T,match ttype , and
max l, t ∶ max measure l, S, t ,measure l, T,match t .
Based on the maximal distance, the similarity is defined as
sim ∶ 1∑ w
to l, tmax l, t w
tr l, tmax l, t w
tm l, tmax l, t∈ , ∈S
amountofmatchedtimepoints ∗ amountoflabels
with wto, wtr, and wtm being the weighting factors, with wto + wtr + wtm = 1.
For simplicity, the division by zero (i.e., the maximal distance is zero) is
not handled within the formula. Nevertheless, if the maximal distance is
zero, the division is assumed to be zero, i.e., the distance is assumed
6.4 Temporal Similarity Measure 119
to be equal. A similarity of 1 means that the results are equal (i.e., sim-
ilarity of 100 %), whereby a similarity of 0 indicates that the sets are as
different as possible (i.e., similarity of 0 %).
The temporal similarity measure is the answer to RQ5 "What similarity
measure can be used to compare time interval datasets, enabling the
search for similar subsets". The presented solution covers three different
aspects of similarity: temporal order, temporal relation, and temporal meas-
ure. The importance of each aspect can be weighted by factors, depending
on the use case. In addition, it enables the analyst to use matching func-
tions and unmatchable time points, to specify which time points are rele-
vant for similarity.
7 TIDAIS: An Information System for Time Interval Data
In this chapter, an information system to analyze time interval data is pre-
sented. The system realizes the previously introduced TIDAMODEL, TIDAQL,
and TIDADISTANCE. The heart of the system is a bitmap-based data struc-
ture, which ensures a high performance when filtering and aggregating.
The chapter is structured as follows: First, the architecture of the system
is presented and motivated along the requested features, as well as the
already presented requirements arising from the definitions. In section 7.2,
an XML configuration for a model and the system is introduced. The follow-
ing section, i.e., section 7.3, presents selected challenges regarding the
implementation of the system’s components. In section 7.4 a prototype of
a web-based GUI is shown. The chapter concludes with a summary of the
presented results.
7.1 System’s Architecture, Components, and Implementation
The system’s architecture is depicted in Figure 7.1. The figure illustrates
the components and interfaces of the information system. Furthermore, the
provided services of the components are shown and the connections be-
tween consumers and services are illustrated.
The different components are motivated by the different features and
requirements defined within the previous chapters. First of all, a JDBC and
HTTP interface providing the data of the system is requested (cf. VIS-01
and VIS-05). In addition, a default GUI should be available to perform mon-
itoring tasks (e.g., check system health), administrative tasks (e.g., create
user or create roles), and visualize results (cf. CIS-03 and VIS-04). Another
request deals with the possibility to subscribe to events triggered by the
system. Thus, a scheduler and event manager must be available (cf. VIS-
02, PD-01, and MA-02). To support a query language (cf. chapter 5), the
system needs to parse and process the queries. In addition, an authenti-
cation and authorization instance is needed, ensuring the correct access
© Springer Fachmedien Wiesbaden GmbH 2016P. Meisen, Analyzing Time Interval Data,DOI 10.1007/978-3-658-15728-9_7
122 7 TidaIS: An Information System for Time Interval Data
to and controlled usage of the system. Another needed component, is re-
sponsible to push data into the system, or more specific a data retriever is
needed loading the generated data into the system. The heart of the sys-
tem is a data repository and a model manager. The former is needed to
handle data internally, which is pushed into the system (e.g., pre-pro-
cessing (cf. DI-02, DI-03, and DI-04), event generation, apply aggregation
operators, analyses (cf. MA-02 and PD-02), or indexing), whereby the latter
manages the models (e.g., validation, loading, unloading, and deletion).
Figure 7.1: The architecture of the information system showing the high-level components.
In the following, the components which are realized using available
open-source or proprietary libraries are listed, explained, and the used im-
plementation is mentioned. Afterwards, i.e. within the subsections, chal-
lenging components regarding the realization are introduced and de-
scribed in detail.
7.1 System’s Architecture, Components, and Implementation 123
– Authentication & Authorization: The component validates any access
to the system. Thus, the most important tasks are user management
(i.e., manage users and roles, define permissions), session manage-
ment (i.e., providing a HTTP interface, which communicates across
several connections, forces the usage of sessions), and validation (i.e.,
who is accessing and which permissions are given). The implementa-
tion is based on the Apache Shiro43 framework, which "is a powerful
and easy-to-use Java security framework that performs authentication,
authorization, cryptography, and session management" (Apache Shiro
Group 2015). Apache Shiro supports authentication using pluggable
data sources, e.g., Lightweight Directory Access Protocol (LDAP),
JDBC, or Active Directory (AD). The information system integrates the
framework through an API, so that a replacement can be performed
transparently for the rest of the system.
– Data Retriever: The data retriever component is used to pull (e.g., by
polling or any wake-up) or push data into the system. In general, the
implementation offers an API to add pull or push data retrievers to the
system. Three base implementations of the API are implemented: read
data from file (i.e., CSV), retrieve data from a database (i.e., using a
SQL query), and load data directly from the configuration (cf. section
7.2.1). The implementation to retrieve data from a database is based
on the HikariCP44 connection pool manager, which is currently sup-
posed to be the fastest connection pool available (cf. Brett Wooldridge
(2015)).
– Scheduler & Event Manager: To enable the system to perform sched-
uled tasks and trigger notification on certain events, the scheduler and
event manager component is added. The scheduler utilizes the Quartz
Scheduler45 and offers the planned creation of services based on the
43 http://shiro.apache.org 44 https://github.com/brettwooldridge/HikariCP 45 http://quartz-scheduler.org/
124 7 TidaIS: An Information System for Time Interval Data
available data. In addition, the event manager is a simple publish-sub-
scribe implementation using the default Java libraries (e.g., thread ex-
ecutor pools). The information system provides API to integrate other
event manager or scheduler. Thus, the use of, e.g., a Java message
service (JMS) based approach could be easily realized.
– Service Handler: Providing services to the outer world is an important
aspect of the system. The service handler component is responsible
for the provided service, i.e., starting, stopping, handling requests, and
providing the results. Because of the features requested, the default
implementation provides two services: (1) a HTTP service handling
data requests (e.g., using asynchronous JavaScript and XML (AJAX))
and (2) a JDBC service capable to handle requests using the available
JDBC driver. The HTTP service is based on the Apache HTTPCompo-
nents46 library, using the HttpCore component of the library to handle
HTTP requests. In addition, a minimal, fast, lightweight, and simple
JSON library, namely minimal-json47, is used to wrap the results when
responding. JDBC requests are, after authentication and authorization,
forwarded to the parser and processor of the query language. Thus,
further implementations are not needed.
– TIDAQL Parser & Processor: The language introduced in chapter 5 is
parsed and processed by this component. The parser of the language
was created using ANTLR448, a tool to create parsers based on a spec-
ified grammar. The processing utilizes the data repository to, e.g., re-
trieve aggregated data, or results of analyses. Thus, the processing is
not further introduced in the context of the language. Instead, the dif-
ferent aspects to create a result are presented while explaining the
data repository in detail (cf. section 7.1.1).
– TIDAMODEL Manager & Loader: The model manager and loader are
responsible to provide the definitions of a model, e.g., the descriptors,
46 https://hc.apache.org 47 https://github.com/ralfstx/minimal-json 48 http://www.antlr.org
7.1 System’s Architecture, Components, and Implementation 125
the integration processes, and concrete implementations, as well as to
manage the availability of a model. These different responsibilities are
introduced in more detail in section 7.3.1. Nevertheless, from an imple-
mentation point of view the component is realized by handling the dif-
ferent objects representing a model. The creation and assembling of
these objects is done using the Spring framework49. Specifically, a con-
figuration following the definition presented in section 7.2 is trans-
formed into a bean configuration and loaded using a default bean-fac-
tory provided by the Spring framework.
– TIDAUI: The GUI is shown in the figure as an external component, i.e.,
not part of the TIDAIS. In general, the GUI utilizes the provided HTTP
interface to retrieve and interact with the system. Nevertheless, the in-
formation system is completely separated from the GUI and another
implementation could be utilized without changing the information sys-
tem. The GUI is presented in detail in section 7.4.
As mentioned earlier, within the next subsections, the not yet discussed
components are introduced, i.e.: the Data Repository, as well as the Cache
and Storage component. These components are presented in more detail,
because their architecture is more complex (i.e., several subcomponents
are needed and open-source or proprietary solutions are not generally
available).
7.1.1 Data Repository
The data repository is the component responsible for all data related tasks
like pre-processing, aggregation, or analyses. In addition, the internally
used data representation, as well as the index structure is managed and
utilized. The component consists of the following subcomponents: pre-pro-
cessor, aggregator, analyses manager, TIDADISTANCE calculator, and the
index structure. Figure 7.2 illustrates the components and the connections
between them.
49 http://projects.spring.io/spring-framework/
126 7 TidaIS: An Information System for Time Interval Data
Figure 7.2: Detailed architecture of the data repository component.
For reasons of clarity, the figure shows only the connections regarding the
external interfaces update, get, retrieve and modify. The external interfaces
of the Scheduler and Event Manager (i.e., inform and assign) are con-
nected with every component capable to be observed (i.e., firing events).
In addition, the retrieve interface is used by all components, which need to
retrieve model information (i.e., the Analyses Manager, Aggregator, and
Pre-Processor). Each of these components is explained in the following:
– Pre-Processor: The pre-processor component is utilized whenever
data is loaded into the system. It is capable of accessing any available
data, so that complex integration processes can be realized. In addi-
tion, default cleansing steps, as requested by DI-02 and DI-03, are ap-
plied (cf. section 7.2.1). Finally, the mapping functions, as defined by
the model, are used to create a processed time interval data record.
The implementation is outlined in section 7.2.1.
– Aggregator: The aggregator component is responsible for providing ag-
gregation techniques (as mentioned and argued in section 5.3.3 the
7.1 System’s Architecture, Components, and Implementation 127
supported techniques are STA and TAT). The component has to evalu-
ate the type of aggregation (i.e., the type of the aggregation depends
on the fact function of the descriptor), retrieve the needed data using
the index, and calculate the result. The algorithms used to determine
the result of an aggregation are presented in section 7.3.4.
– Analyses Manager: The main responsibility of the component is the
retrieval of results created through data analysis techniques. The man-
ager registers and instantiates the algorithms implemented against an
API provided by the system and defined by a model or the system’s
configuration. Whenever an analytical result is requested, the manager
checks the availability of the specified algorithm and triggers the exe-
cution. An analysis can be performed asynchronously and even on dif-
ferent machines. The implementation of the manager is not presented
any further, because it is mainly based on available core Java libraries,
i.e., collections, reflection, thread executor pools, and JMS.
– TIDADISTANCE Calculator: The component represents a concrete imple-
mentation of the distance introduced in chapter 6. The component is
developed against the analysis API of the system and the reference
implementation of it. The implementation is presented in detail in sec-
tion 7.3.5.
– Index Structure: The core of the data repository is the index structure.
The component ensures fast data retrieval. The different parts of the
implementation are presented in section 7.3.2.
7.1.2 Cache & Storage
The Cache & Storage component is responsible to store different entities
(e.g., a bitmap or a fact descriptor; cf. section 7.3.2) of the information sys-
tem. Figure 7.3 depicts the different subcomponents of the component, i.e.,
Cache, Storage Layer, and Usage Statistic Manager.
128 7 TidaIS: An Information System for Time Interval Data
Figure 7.3: Illustration of the subcomponents of the main component Cache & Storage.
In the following the different components and their responsibilities are
introduced:
– The Storage Layer to be used differs based on the usage (i.e., type of
operations performed) and the type of data (e.g., complex objects,
plain old java objects (POJO)). It is generally not possible to select a
"best" storage. Thus, the system provides an API to implement any
storage, e.g., SQL databases, NoSQL databases, or other persistency
layers.
– The Cache is used to increase the retrieval performance from the stor-
age by caching the retrieved entities in memory. In section 3.3.3 several
caching algorithms are listed. The "best" algorithm to be used depends
on several factors, e.g., the amount of entities, the size of the available
memory, or the storage type. Thus, the component has to be flexible
regarding the used cache implementation and algorithm. Several open-
source caching libraries and frameworks are widely used, e.g.,
ehCache50 or OSCache51. The component’s data structure, API, and
used design patterns are presented in section 7.3.3.
50 http://ehcache.org 51 https://java.net/projects/oscache
7.2 Configuration 129
– The Usage Statistic Manager is an optional component. It may be nec-
essary to provide the cache with a usage statistic so that the algorithm
can decide, which entities to remove from memory. In general, the
maintenance of this statistic decreases the performance of the system.
A discourse about the performance of the cache algorithms is briefly
discussed in section 8.2.2.
7.2 Configuration
The configuration of the information system can be separated in two differ-
ent parts. The first part deals with the configuration of the used compo-
nents. As mentioned in the previous section, it is important to ensure that
specific components of the system can be extended (e.g., add another
analysis algorithm), replaced (e.g., use a different authentication and au-
thorization framework), or modified regarding the behavior (e.g., change
the caching algorithm used by a cache). The second part addresses the
configuration of a model. A model is formally defined by a 4-tuple (, , , )
as defined in chapter 4 and loaded using a load statement (cf. section 5.2)
containing the location of a model-definition-file. Such a model-definition-
file must cover the formal definition and, in addition, override system spe-
cific settings, e.g., it may be necessary to utilize a specific indexing algo-
rithm for a model.
In this chapter both parts of the configuration are introduced and exam-
ples are given using excerpts of configurations. In section 7.2.1 the model
configuration is shown and in section 7.2.2 the system configuration is pre-
sented. The order in which the different configurations are presented, is the
other way around compared to the inheritance hierarchy (i.e., the model
configuration overrides the system configuration), due to it being easier to
motivate several configurable settings from a model perspective first.
130 7 TidaIS: An Information System for Time Interval Data
7.2.1 Model Configuration
A model is defined using XML52. The root element of each model definition
is the model tag as shown in Listing 7.1. In addition, an identifier for the
model has to be specified using the id attribute. The identifier is used to
refer to the model, e.g., when requesting data using a select statement.
Optionally, a readable name for the model can be provided.
Listing 7.1: The skeleton of a model-configuration-file of the information system.
<?xml version="1.0" encoding="UTF‐8" standalone="no"?>
<model id="myFirstModelId" name="My first Model"
xmlns="..." xmlns:xsi="..." xsi:schemaLocation="...">
<!‐‐ Model Definition ‐‐>
</model>
The definition of a model is based on a time interval database, de-
scriptors, a time-axis, and dimensions. Within a model-configuration-file all
these items may be specified. In addition, several other components can
be configured, e.g., the Pre-Processor, the Cache & Storage, and the Index
Structure. In the following subsections, the configuration settings dealing
with different aspects of a model are presented. Afterwards, the additional
configurable settings regarding components are explained.
Defining a Time Interval Database
The definition of a TIDAMODEL includes the definition of the source, i.e., the
database from which raw data is retrieved (cf. section 4.3). In general, it is
important that the system supports several possibilities to load data into
the system. In order to provide the user with the greatest possible flexibility
and ensure usability for an inexperienced user, a source for the time inter-
val database, a so called data retriever, can be utilized. Asking the users
for the commonly used sources, revealed that time interval data is typically
stored in operational databases or CSV files. In addition, users mentioned
52 A complete model-configuration-file can be found in the appendix.
7.2 Configuration 131
that for training purposes the definition of records within a model would be
desirable. Therefore, by default, the system provides a FixedStructureDa-
taRetriever, a DbDataRetriever, and a CsvDataRetriever. Furthermore, it is
possible to extend the system and provide additional data retrievers.
The time interval database of a model can thereby be defined in three
ways, i.e.:
– the database can be defined to be initially empty, and filled, e.g., by
insert statements, or
– the database can be filled by loading data from a data retriever (i.e.,
configuring a default data retriever or an extended implementation), or
– the database is defined to be static, i.e., the data is defined within the
model (internally the system utilizes the mentioned FixedStructureDa-
taRetriever).
From a configuration perspective, an empty database is configured by
changing nothing. The default configuration assumes that the data will be
loaded via the provided HTTP or JDBC interface, i.e., using insert state-
ments. If a data retriever should be used to load data from an external
source, the retriever has to be defined within the configuration. Listing 7.2
shows an excerpt, defining a data retriever with the identifier myDb using
the DbDataRetriever implementation (cf. DC-01).
Listing 7.2: Configuration of a data retriever within a model.
<dataretrievers>
<dataretriever id="myDb"
implementation="net.meisen.[...].dataretriever.DbDataRetriever">
<db:connection type="jdbc"
url="jdbc:hsqldb:hsql://localhost:300/db"
driver="org.hsqldb.jdbcDriver" username="SA" password="" />
</dataretriever>
</dataretrievers>
To configure the system to load data from the data retriever, it is necessary
to specify the query to be used, as well as the structure of the data records.
132 7 TidaIS: An Information System for Time Interval Data
The former is specified within the data tag, which is positioned last in the
root. The latter is specified using the structure tag, associating the different
fields of the incoming data to descriptors or temporal information. Listing
7.3 shows an excerpt of the configuration defining a structure and a data
segment. The configuration defines data to be retrieved using the specified
query. The retrieved data is mapped according to the provided structure,
i.e., the field NAME contains the values to be used for the descriptor
PERSON, whereas the field START and END define the start and end val-
ues of the interval.
Listing 7.3: Configuration of a dataset and the structure of the set.
<structure>
<meta name="NAME" descriptor="PERSON" />
<interval name="START" type="start" />
<interval name="END" type="end" />
</structure>
<data dataretriever=" myDb">
<db:query>SELECT START, END, NAME FROM TABLE</db:query>
</data>
The data retriever sample exemplifies how the system realizes extensibility.
The required information like the structure of data, the data retriever, and
data to be retrieved, are fixed within the configuration. The kind of data
retriever, as well as the method how to retrieve data is extended. An exten-
sion for the system typically consists of a concrete implementation and cut-
points for the configuration, i.e., a XSLT and a XSD file named as the con-
crete implementation. In the case of the DbDataRetriever the extension
consists of the concrete class extending the abstract class BaseDataRe-
triever, several additional classes (e.g., exceptions or default values), an
XSD specifying the schema of the additional information, and an XSLT
used to define the beans needed when loading the configuration (cf. Figure
7.4).
7.2 Configuration 133
Figure 7.4: The complete package of the DbDataRetriever extension used to load data from a database.
Listing 7.4 shows the DbDataRetriever.xslt defined to transform the
db:query tag from Listing 7.3 into a DbQueryConfig bean. The created
bean is passed to the instance of the specified data retriever (defined by
the attribute dataretriever of the data tag).
Listing 7.4: XSLT template used to create the bean used by the DbDataRetriever to define the query.
<xsl:template match="db:query">
<bean class="net.meisen.[...].dataretriever.DbQueryConfig">
<property name="query">
<value><xsl:value‐of select="normalize‐space(.)" /></value>
</property>
<property name="language" value="SQL"/>
</bean>
</xsl:template>
The formal definition of a time interval database expects, besides the
dataset data, the definition of the domain of the mapping functions, i.e., time,
134 7 TidaIS: An Information System for Time Interval Data
1, …, n. These definitions are needed in the case of a formal definition.
Nevertheless, in the case of the configuration of a model, the domain of an
entity (i.e., time,1, …, n) is not of importance, as long as the mapping
function (i.e., the implementation) can be applied. More precise, assuming
the definition of a time interval database to retrieve data from, the domain
of the descriptive value of a specific value retrieved from the database is
irrelevant, as long as the value can be mapped to a valid descriptor value
(e.g., a string "5" can be mapped to the integer 5). This has to be ensured
by the implementation of the concrete descriptor and the mapping function
(cf. next subsection) and the system does not need the data type of the
descriptive value by configuration.
Defining Descriptors
The definition of a descriptor is based on a mapping function i and a fact
functioni (cf. section 4.2). Within the configuration a descriptor is a child
of the descriptors tag, which itself is a child of the meta tag.
Listing 7.5 shows a definition of three descriptors. The BABY descriptor
contains string values, whereby the descriptor values are specified by a
CSV file loaded using the CsvDataRetriever. The integer descriptor named
DURATION does not initially contain any descriptor value. In addition, an
extended descriptor called TOYS is added. The TOYS descriptor allows null
values (the attribute null is set to true), contrary to the other descriptors
(the default value of the null attribute is false)53. Furthermore, the BABY
descriptor overrides the index used internally for the descriptor values.
Overriding is typically not necessary, because the internal implementation
tries to find the best fitting index for the type of the descriptor values. Nev-
ertheless, the performance may be increased if a type or domain-specific
implementation is provided, e.g., for spatial data an R-Tree (Guttman 1984)
may be more appropriate.
53 A null value is often used as result of the mapping function, if the descriptive value cannot
be mapped to a valid descriptor value. In addition, it may be possible to add records, which have no value for a specific descriptor (i.e., a null value is applied).
7.2 Configuration 135
Listing 7.5: An excerpt of a configuration defining three descriptors and de-scriptor values for one of the descriptors.
<meta>
<descriptors>
<string id="BABY" idfactory="net.meisen.[...].UuIdsFactory" />
<integer id="DURATION" />
<ext:list id="TOYS" name="toy list" null="true" />
</descriptors>
<entries>
<entry descriptor="BABY" dataretriever="csvBabyNames">
<csv:selector column="firstname" />
</entry>
</entries>
</meta>
By default, the following commonly used types of descriptors are avail-
able: integer, double, long, and string. These descriptors provide a pre-de-
fined mapping function (i.e., using the identity function), a set of descriptor
values (which is configurable and extendable), and a pre-defined fact func-
tion (i.e., a constant function returning 1 for string descriptors and the iden-
tity function for numeric descriptors). In addition, the system provides strat-
egies as requested by the feature DI-02. These strategies define how to
handle unknown descriptor values occurring in data pushed into the sys-
tem. The strategy to apply is defined within the data tag specifying the at-
tribute metahandling with one of the values: handleAsNull, createDe-
scriptorValue, or fail. The supported strategies are:
– handleAsNull: the system will try to associate the data record to a null
value for the unknown descriptor (i.e., the descriptor must support null
values),
– createDescriptorValue: the system will create a new descriptor value
for the descriptor and refer to the newly created descriptor, or
– fail: the system will throw an exception and the data record will not be
added.
136 7 TidaIS: An Information System for Time Interval Data
If the mapping or fact function has to be modified, a new descriptor im-
plementation must be added, providing the mapping and fact function, e.g.,
as done with the list descriptor used in Listing 7.5. The extension of a de-
scriptor is similar to the extension of a data retriever, i.e., add a concrete
implementation, an XSD file for validation, and an XSLT file for transfor-
mation. In the case of descriptors, the system provides several base imple-
mentations, as well as useful base validations, and transformations.
Defining a Time Axis
The TIDAMODEL defines the time axis by a mapping function time, a set of
chronons time, and the granularity tgrain (cf. section 4.1). The definition of the
time axis is done using the timeline tag, which is a child of the time tag.
Listing 7.6 shows an example of a definition within the configuration. The
example shows that the definition of the granularity is done explicitly using
the granularity attribute. By default, the system provides the commonly and
additional granularities, e.g., month, week, day, hour, minute, second, or
even attosecond. In addition, the configuration defines the start and the
end, which together with the granularity, defines the set of chronons.
Listing 7.6: An example of a configuration of the time axis.
<time>
<timeline start="01.01.2000 00:00:00" end="31.12.2020 23:59:00"
granularity="MINUTE" />
</time>
Regarding the mapping function, the system provides three possibilities
to define one, i.e.:
– select a strategy on how to handle specific values,
– use a default mapping function, or
– provide a new mapping function.
The following strategies, as requested by feature DI-03, are realized to han-
dle missing endpoints of the defined interval:
7.2 Configuration 137
– boundariesWhenNull: whenever a null value is found for an endpoint of
an interval, the system uses the time axis boundaries, i.e. the start (if
the start value of the interval is null) or end value (if the end value of
the interval is null),
– useOther: if one of the endpoints is null, the other not null endpoint will
be used as value, i.e., a time point is used, or
– fail: the system will throw an exception and the data record will not be
added.
If the existing strategies are not sufficient, it is possible to implement a new
mapping function. In general, such an implementation maps an incoming
value to an integer value (i.e., a long data type). The extensions of the map-
ping functions works similar to all extensions in the system, i.e., provide a
concrete implementation of the abstract class BaseMapper or Base-
MapperFactory, as well as the validation and transformation files for the
configuration.
Defining Dimensions
The TIDAMODEL introduces and defines two different kinds of dimensions,
i.e., dimensions defined for a descriptor and a time dimension. In general,
it is not necessary to define any dimensions. In that case, roll-up or drill-
down operations are not available and data can only be retrieved on lowest
granularity and aggregated on the defined descriptor values. Furthermore,
following the definition (cf. section 4.4), each dimension, independent of its
type (i.e., time or descriptor dimension), can have several hierarchies,
whereby each hierarchy has several levels. Finally, each level contains sev-
eral members. In the following, the configuration of the time and a de-
scriptor dimension is introduced.
The definition of a time dimension, states several constraints, which
have to be met by the dimension’s configuration, i.e.:
– the lowest level of a time hierarchy contains all chronons,
– each level of a time hierarchy forms a valid partition of the set of all
chronons, and
138 7 TidaIS: An Information System for Time Interval Data
– a time hierarchy may be defined for a specific time zone.
The configuration of a time dimension is done within the timedimension
tag, which is a child within the dimensions tag. The configuration allows the
definition of mostly one time dimension, which is configured by specifying
at least one hierarchy. A hierarchy is thereby defined by the different levels,
which are defined as partitions of the chronons of the time axis. The order
of the levels within the configuration defines the roll-up and drill-down order
(from top as top-level, to bottom as lowest-level). Listing 7.7 exemplifies a
hierarchy for the CET time zone. The hierarchy is defined from top to bot-
tom as: all (default) → Year → Month → Day → Half Day → Hour → 5-
Minutes → Minute.
Listing 7.7: A sample definition of a time hierarchy within the time dimension.
<timedimension id="TIME">
<hierarchy id="RASTER" all="Everytime" timezone="CET">
<level id="YEAR" template="YEARS" />
<level id="MONTH" template="MONTHS" />
<level id="DAY" template="DAYS" />
<level id="HALFDAY" template="RASTER_DAY_MINUTE_720" />
<level id="HOUR" template="RASTER_DAY_MINUTE_60" />
<level id="HALFHOUR" template="RASTER_DAY_MINUTE_30" />
<level id="MINUTE5" template="RASTER_DAY_MINUTE_5" />
<level id="MINUTE" template="RASTER_DAY_MINUTE_1" />
</hierarchy>
</timedimension>
The design pattern of templates enables an easy way to provide new levels
to the system. A template has to define a valid partition of the time axis. In
addition, it has to fit into the current order, e.g., the DAYS template as-
sumes a predecessor template, which has a smaller granularity than days
(e.g., HALFDAY). Also, it expects the successor to have a granularity larger
than one day (e.g., MONTH). The raster template is a special template pro-
vided by the system. It is used to split a higher granularity into a partition
7.2 Configuration 139
based on a smaller granularity, i.e., the RASTER_DAY_MINUTE_30 tem-
plate partitions each day into groups of 30 minute units. New templates for
a level can be easily added to the system by implementing the ITimeLev-
elTemplate interface54. Figure 7.5 illustrates the first three levels (from bot-
tom to top) of the defined hierarchy. The example shows the handling of the
time zone and DST.
Figure 7.5: Illustration of the first three levels (from bottom to top) of the hierar-chy defined in Listing 7.7.
The configuration of a descriptor’s hierarchy is also done within the di-
mensions tag. In contrast to the time dimension, the definition of a de-
scriptor’s hierarchy can be non-onto, non-covering, or non-strict. Because
of these differences, the configuration of a descriptor’s hierarchy differs
from the one of a time hierarchy. Furthermore, the configuration allows the
definition of several dimensions for descriptors, but at most one for each
descriptor. Listing 7.8 shows the configuration of a dimension for the de-
scriptor WORKAREA. The configuration contains one hierarchy with three
levels. The descriptor values are bound by regular expressions to the mem-
bers. If no regular expression is specified, the system assumes a member
of the hierarchy, i.e., the member is an element of V' (cf. section 4.4).
54 In addition, the system provides several helpful base implementations, e.g., BaseTimeLev-
elTemplate used to implement all templates provided.
140 7 TidaIS: An Information System for Time Interval Data
Listing 7.8: A sample definition of a hierarchy of the descriptor WORKAREA.
<dimension id="DIM" descriptor="WORKAREA">
<hierarchy id="ROOMS" all="all rooms">
<level id="TYPE">
<member id="SUITE" rollUpTo="*" />
<member id="COMFORT" rollUpTo="*" />
<member id="STANDARD" rollUpTo="*" />
</level>
<level id="GUESTS">
<member id="BUSINESS" rollUpTo="*" />
<member id="PRIVATE" rollUpTo="*" />
</level>
<level id="FLOOR">
<member id="FLOOR1" reg="LVL1_.*" rollUpTo="PRIVATE, STANDARD"
/>
<member id="FLOOR2" reg="LVL2_*" rollUpTo="BUSINESS, COMFORT"
/>
<member id="FLOOR3" reg="LVL3_.*" rollUpTo="PRIVATE, SUITE" />
</level>
</hierarchy>
</dimension>
The defined hierarchy is non-strict, because a member of the FLOOR level
rolls up to two different members, e.g., FLOOR1 rolls up to PRIVATE and
STANDARD. Figure 7.6 depicts the configured hierarchy of the dimensions
of the WORKAREA descriptor.
Figure 7.6: Illustration of the hierarchy defined in Listing 7.8.
7.2 Configuration 141
Configuring the Pre-Processor, the Scheduler & Event Manager,
the Cache & Storage, and the Index Structure
Besides the configuration of the model, a model-definition-file can be used
to configure the behavior of the components of the system, when handling
model dependent data. All components related configuration is done within
the config tag by adding the corresponding component’s tag as child. In the
following the different tags and possibilities to configure a component are
introduced, starting with the pre-processor (cf. DI-04).
Configuring a pre-processor to be used for transformation of the data
pushed into the model, is done within the preprocessor tag. The pre-pro-
cessor can be defined as any class implementing the IPreProcessor inter-
face, using the implementation attribute. In addition, cut-points can be used
to extend the configuration and enable pre-processor related settings. By
default, the system provides a ScriptPreProcessor, useful to specify a
script55 transforming the incoming data, e.g., using JavaScript, Groovy, or
Python. Listing 7.9 shows an excerpt of a configuration, defining a pre-pro-
cessor using JavaScript. The script is used to trim the descriptive value of
the myString descriptor. All other descriptive values and time points are
kept untouched by the script.
Listing 7.9: A pre-processor configuration using the ScriptPreProcessor.
<preprocessor implementation="net.meisen.[...].ScriptPreProcessor">
<spp:script language="javascript">
var result = new net.meisen.[...].PreProcessedDataRecord(raw);
result.setValue('myString', raw.getValue().trim('myString'));
</spp:script>
</preprocessor>
The scheduler and event manager can be used to define schedules fir-
ing specific queries, forwarding results, triggering events, and publish infor-
mation to subscribed instances. The configuration supports the definition
55 In general, any scripting language which is supported by the Java Scripting API can be
used.
142 7 TidaIS: An Information System for Time Interval Data
of different schedules, which may also be used to push events to the event
manager. In general, the system publishes several core events (e.g., when
a query is fired), which can be subscribed to through the JSON interface
or by a schedule. Listing 7.10 illustrates the configuration of three sample
schedules.
Listing 7.10: A configuration specifying three sample schedules.
<schedules>
<schedule cron="10 0 * * *"
implementation="net.meisen.[...].MyJob"/>
<schedule cron="*/15 4‐16 * * 6,7"
implementation="net.meisen.[...].QueryJob">
<qj:query>SELECT COUNT(RECORDS) FROM myModel</qj:query>
<qj:handler>net.meisen.[...].QueryHandler</qj:handler>
</schedule>
<schedule event="core:query"
implementation="net.meisen.[...].EventJob" />
</schedules>
The first two schedules are based on a cron-expression56, whereby the third
one is assigned to a core event. The first schedule executes the mentioned
implementation every day ten minutes after midnight. The second schedule
fires the specified query every 15 minutes between 4am and 4pm on Sat-
urdays and Sundays. The result of the query is sent to the optionally spec-
ified handler, which, e.g., could create a report and send it to the manage-
ment or validate the result and notify a user via a message. The last sched-
ule is assigned to a core event, i.e., is triggered every time a query is fired.
The executed job retrieves event-specific information. In general, a job can
fire additional events, which then are handled by the event manager. As
already mentioned, the implementation is based on the Quartz Scheduler
and standard Java components.
The configuration of the Cache & Storage component allows the defini-
56 http://pubs.opengroup.org/onlinepubs/007904975/utilities/crontab.html
7.2 Configuration 143
tion of the caches and storage implementations to be used for specific en-
tities of the system. These entities are raw records, record identifiers,
metadata, bitmaps, and sets of facts. To understand the configuration of
the caches and the storage, it is not important to understand these different
entities in detail. Nevertheless, a more detailed explanation of the entities
is given in the context of the implementation of indexes, i.e., in section
7.3.2. Listing 7.11 shows an example of a configuration, specifying the
cache and implicitly the storage to be used for the different entities. The
configuration defines a file-based storage for the record identifiers (identi-
fier), metadata (metadata), bitmaps (bitmap), and sets of facts (factsets).
For the storage of the raw records (records) a DBMS is utilized, using a
Hibernate57-based implementation. The extension and provision of new im-
plementations is done by implementing the provided interfaces (i.e.,
ICache) and specifying cut-points for the configuration as described before.
In addition, the example also shows the configuration of the caching algo-
rithm to be applied. In the example, the RandomCachingStrategy (cf. sec-
tion 3.3.3; cache algorithm RR) is explicitly used for the bitmap cache. If
the configuration of a caching algorithm is supported depends on the im-
plementation used, e.g., some implementations may not support the mod-
ification of the caching algorithm. Other settings, like the cleaning factor or
the maximal amount of cached entities, may by configurable. In the exam-
ple the default settings of the cache responsible for the sets of facts is over-
ridden.
Listing 7.11: Example of a configuration of caches for all entities of the system.
<caches>
<identifier implementation="net.meisen.[...].FileIdentifierCache" />
<metadata implementation="net.meisen.[...].FileMetaDataCache" />
<bitmap implementation="net.meisen.[...].FileBitmapCache">
<bfile:config strategy="net.meisen.[...].RandomCachingStrategy" />
</bitmap>
<factsets implementation="net.meisen.[...].FileFactDescriptorCache">
57 http://hibernate.org
144 7 TidaIS: An Information System for Time Interval Data
<ffile:config cleaningFactor="0.2" size="1000000" />
</factsets>
<records implementation="net.meisen.[...].HibernateDataRecordCache">
<hib:config driver="org.hsqldb.jdbcDriver"
url="jdbc:hsqldb:hsql://localhost:7000/db"
username="SA" password="" />
</records>
</caches>
Another component, which can be modified by configuration, is the In-
dex Structure. The configuration allows to specify a factory deciding which
index to use for specific use-cases. The default implementation of the fac-
tory, i.e., IndexFactory, permits determining the used indexes for specific
data types. In addition, the used bitmap implementation can be specified,
e.g., to change the used compression scheme (cf. section 3.3.1). By de-
fault, the system provides several indexes based on different high perfor-
mance collections useful for primitive data types, i.e., Trove58, FastUtil59, or
Hppc60. Several benchmarks were performed to set up the implemented
IndexFactory, and to ensure an overall best performance (cf. section 8.2.1).
Nevertheless, context specific criteria may lead to better choices, which
can be configured, or by providing an own factory (cf. Listing 7.12).
Listing 7.12: An example configuration of the default IndexFactory, specifying the implementations used to index specific data types.
<indexes implementation="net.meisen.[...].IndexFactory">
<idx:config bitmap="net.meisen.[...].RoaringBitmap"
byte="net.meisen.[...].TroveByteIndexedCollection"
short="net.meisen.[...].TroveShortIndexedCollection"
int="net.meisen.[...].TroveIntIndexedCollection"
long="net.meisen.[...].TroveLongIndexedCollection"
default="java.util.HashMap" />
</indexes>
58 http://trove.starlight-systems.com 59 http://fastutil.di.unimi.it 60 http://labs.carrotsearch.com/hppc.html
7.2 Configuration 145
7.2.2 System Configuration
In general, the configuration of a system should be as simple as possible
to increase the ease-of-use and help inexperienced users to get started.
Thus, the simplest configuration is one that is not needed at all61. Instead,
the system uses default settings, which can be overridden by providing a
configuration-file. A configuration-file can be used to define the default set-
tings for several components, replace an implementation or extend specific
features. The configuration-file is, like the model-definition-file, XML based
and has the skeleton shown in Listing 7.13.
Listing 7.13: The skeleton of a configuration-file of the information system.
<?xml version="1.0" encoding="UTF‐8" standalone="no"?>
<config xmlns="..." xmlns:xsi="..." xsi:schemaLocation="...">
<!‐‐ Configuration ‐‐>
</config>
The system’s configuration allows defining the implementation or the
settings of several components, i.e., the Authentication & Authorization, the
Service Handler, the Query Parser & Processor, the Index Structure, or the
Cache & Storage. In addition, the available templates for the time dimen-
sion, aggregation operators, analysis techniques, or granularities of time
can be defined. The structure of this section is a follows: First, the configu-
ration possibilities of the different components is introduced. Special focus
is on the components that cannot be defined within a model-configuration-
file, because the configuration of other components is similar to the one
presented in the previous section. Afterwards, the configuration capabilities
regarding the templates, aggregation operators, analysis techniques, and
granularities are introduced.
61 Nevertheless, a sample of a complete configuration-file is presented in the appendix.
146 7 TidaIS: An Information System for Time Interval Data
System Configuration of Components: Authentication & Authorization,
Service Handler, and Query Parser & Processor
The configuration of the Index Structure and the Cache & Storage compo-
nent is similar to the one presented in section 7.2.1 and therefore not fur-
ther discussed. A sample configuration of the Authentication & Authoriza-
tion component is shown in Listing 7.14. The sample illustrates the usage
of the AllAccessAuthManager, with is mainly used for testing purposes. The
implementation accepts any username and password combination and
grants all permissions available to the logged in user. The second imple-
mentation deployed with the default system is the ShiroAuthManager,
which is based on the already mentioned Apache Shiro framework. The
implementation is meant to be used in productive systems and allows the
creation and management of users, roles, and permissions. In general, the
component can be replaced via configuration and cut-points may be used
to extend the configuration capabilities.
Listing 7.14: A sample configuration of the Authentication & Authorization compo-nent.
<auth>
<manager implementation="net.meisen.[...].AllAccessAuthManager" />
</auth>
The settings of the Service Handler component, responsible for for-
warding and accepting the request, as well as deliver the response, can be
modified with regards to the ports, timeouts, and availability. Listing 7.15
shows an excerpt of a system configuration-file. Within the example, the
ports are specified for the three interfaces: http, tsql, and control. The con-
trol interface was not introduced so far. It can be enabled to shut down the
server remotely. In addition, the http interface offers the possibility of defin-
ing the document root directory, i.e., the directory to look for website files.
If the attribute is not specified, the system will just start the services to
retrieve data via http in JSON.
7.2 Configuration 147
Listing 7.15: Example of the system configuration of the Service Handler compo-nent.
<server>
<http port="7000" timeout="30" enable="true" docroot="www" />
<tsql port="7001" timeout="1800000" enable="false" />
<control port="7002" enable="true" />
</server>
Last but not least, the configuration of the Query Parser & Processor is
shown. The configuration allows the user to replace the query language
with an own, maybe domain-specific, language. The default implementation
supports the TIDAQL presented in chapter 5. The configuration is defined
as child of the factories tag using the queries tag. The implementation must
implement the IQueryFactory interface to be recognized by the system.
Listing 7.16 shows an excerpt defining the default QueryFactory to be used
by the system to parse and process incoming queries.
Listing 7.16: Example of the system configuration of the Query Parser & Proces-sor component.
<factories>
<queries implementation="net.meisen.[...].QueryFactory" />
</factories>
Extending the Templates, Aggregation Operators, Analysis Techniques,
and Granularities
Instead of replacing complete implementations for specific components,
the system supports the capability to extend the functionality by configura-
tion. The extendable functionalities are:
– add new templates for the time dimension (cf. section 7.2.1),
– specify new aggregation operators useable within the query language
(cf. section 5.3.3),
– define new analysis techniques (cf. section 5.3.3), and
148 7 TidaIS: An Information System for Time Interval Data
– allow additional granularities of the time axis (cf. section 4.1 and sec-
tion 7.2.1).
The integration of the different extensions differs regarding the configura-
tion. Nevertheless, all extensions have in common that an implementation
has to be provided implementing the corresponding interface, i.e.,
ITimeLevelTemplate, IAggregationFunction, IAnalysis, or ITimeGranularity.
Regarding the configuration, the different techniques are explained in the
following, starting with the extension of a template for the time dimension.
Templates can be easily added, by adding the concrete implementation to
the configuration as shown in Listing 7.17.
Listing 7.17: Example of the system configuration to add an additional template.
<timetemplates>
<template implementation="net.meisen.[...].templates.WeekDays" />
</timetemplates>
Similarly, the extension of aggregation operators is defined (using the
aggregations tag instead of timetemplates tag and the function tag instead
of the template tag). Instead of providing a concrete implementation of a
template, a concrete implementation of an aggregation operator is pre-
sented. By default the following operators are added: count, min, max, sum,
mean, median, and mode (cf. DA-01). In addition, temporal aggregation
operators are available, i.e., count started and count finished (cf. DA-02).
Depending on the form of aggregation (cf. section 2.1.2) the application of
the temporal operator may be possible or not. Thus, several extensions of
the ITimeLevelTemplate are available, to specify the utilization of an oper-
ator (cf. section 7.3.4):
– ILowAggregationFunction (i.e., the operator must be applied to values
of the lowest granularity, e.g., SUM(DESC1)),
– IDimAggregationFunction (i.e., the operator aggregates results, e.g.,
SUM(MAX(DESC1, DESC2))), and
7.3 Data Structures & Algorithms 149
– IMathAggregationFunction (i.e., the operator is used to combine values
mathematically, e.g., SUM(5, 4, 7)).
Registering new analysis techniques to the system is also realized by
simply specifying the implementation (using analyses as parent tag and
analysis as child tag). The analyses manager collects the registered in-
stances and provides the implementation after resolving the name used,
e.g., within the query language. The configuration of the concrete analysis
may have additional configuration capabilities, which are defined using the
already presented cut-points technique.
Last but not least, the extension capabilities regarding the granularity of
the time axis are explained. The default implementation utilizes a time gran-
ularity factory, which applies several techniques to search for a granularity
definition on the class-path of the application. Thus, it is typically enough
to add the new implementation on the class-path and use the full-qualified
name when referring to the granularity. It is also possible to place the con-
crete implementation in one of the pre-defined packages (e.g.,
net.meisen.dissertation.model.time.granularity). If none of these tech-
niques are sufficient, it is also possible to just replace the factory’s imple-
mentation and provide an own factory instance.
7.3 Data Structures & Algorithms
This section deals with selected aspects of the realization, which were
challenging and are interesting regarding data structures and algorithms
used to create a performant, stable, and usable system. In section 7.3.1
selected features implemented to handle the configuration of models (i.e.,
validation and mapping) are introduced. In addition, the section presents
the internal handling of the time axis. Section 7.3.2 introduces the mainly
bitmap-based indexes used to process different query types. Several utili-
zations of the indexes are illustrated and discussed. The implementation of
the cache and storage interface is introduced in section 7.3.3. The pre-
150 7 TidaIS: An Information System for Time Interval Data
sented implementation solves the handling of the garbage collection re-
garding cached items. In section 7.3.4, the algorithm to perform the ITA and
the TAT is introduced. The algorithm utilizes the different indexes to achieve
an excellent performance. The algorithm to calculate the distance and de-
termine the k-NN of an input query is introduced in section 7.3.5. The pre-
sented algorithm utilizes the provided indexes and introduces a pruning
technique to increase the performance.
7.3.1 Model Handling
A model is the heart of the information system. Handling data pushed into
a concrete, model-specific structure is introduced and discussed in the sec-
tions 7.3.2 and 7.3.3. The utilization of the structures used to calculate ag-
gregations and distances is presented in section 7.3.4 and 7.3.5. However,
the internal representation of specific elements of the model is presented
in this section, i.e., the time axis and descriptors. In addition, this section
presents selected algorithms, i.e., processing of a raw data record, valida-
tion of descriptor’s dimensions, as well as mapping of descriptive values
and data time points.
TimeAxis Data Structure
The data associated to specific chronons, is the most frequently requested
information within the system. As mentioned previously, internally a chro-
non represents, if the time axis is based on time, a time point in the UTC
time zone. Each chronon is thereby normalized, so that the start of the time
axis is represented by 0 and the end of the time axis is represented by the
amount of chronons - 1. If on the other hand the time axis is integer based,
i.e., the start and end values are specified by integers, no time zone is
applied. Figure 7.7 illustrates three configurations and the internal, normal-
ized representation. Assuming the definition of the time axis shown on top,
the value 2005-01-01 is mapped to the value 4. Regarding the time
2015-01-20 08:07:00 and the definition shown in the middle, the time is
normalized to 29,287. Using the definition of the time axis shown on the
7.3 Data Structures & Algorithms 151
bottom of Figure 7.7, the value 1981 is represented by 1931 (i.e.,
1981 - 50, because 50 is the defined start).
Figure 7.7: Three different time axis configurations and an illustration of the in-ternal representation as array.
The data structure used to realize a time axis must be capable of han-
dling large amounts of chronons and performant when iterating over a time-
window or updating associated information. To use the best fitting structure,
an algorithm evaluates the defined time axis definition (cf. section 7.2.1) by
determining the amount of chronons to be handled. Based on the result of
the calculation and the available memory for a model (configuration de-
pendent), the structure chosen differs between
– a dynamic array (i.e., internally a list collection is used, which is ex-
tended if needed),
– a fixed array (i.e., a typical array), or
– an extended array (i.e., if the expected size exceeds the memory or the
maximal size of an array62, nested arrays are utilized).
Independent of the chosen type of array, the resulting structure is capable
of retrieving an element for a specific integer value (internally the primitive
data type long is used, which allows a maximum of 263 – 1 elements). Thus,
the retrieval of an element associated to a specific chronon is achieved,
independently of the chosen type, by simply calling the get(long) method.
62 Java can hold up to 231 - 1 elements within one array, which needs a size of 8 GB main
memory. Nowadays, this amount of memory is not a limit anymore.
152 7 TidaIS: An Information System for Time Interval Data
Nevertheless, the runtime to retrieve a value from the internally used data
structure depends on the type of array and if the associated element is
cached or not. A (cached) value can be retrieved from a dynamic or an
extended array in O(1) and added in O(n). Regarding a fixed array, the per-
formance of retrieving and adding is O(1). Thus, the preferred type is the
fixed array, which is selected if enough memory is available and the amount
of chronons does not exceed the available size.
Temporal Mapping Function
An important aspect is the handling of interval endpoints, which do not fit
to the time axis granularity or boundaries, e.g., assuming the top time axis
of Figure 7.7 and the value 2015-06-22. If a value does not fit neatly to a
specific granularity, the algorithm has to decide if the value has to be
mapped to the smaller or larger representative. Table 7.1 lists some results
of the mapping algorithm, introduced below, assuming the top time axis
definition of Figure 7.7. It should be stated that the presented results are
not showing the internal index values (i.e., the normalized value). Instead,
the actual year is shown (i.e., the de-normalized value).
Table 7.1: Results of the default temporal mapping algorithm, assuming the top time axis definition of Figure 7.7.
# Interval
([date, date]) Result
([year, year]) Visualization
1 [2001-01-01, 2002-03-01]
[2001, 2003]
2 [1981-01-20, 2081-01-20]
[2001, 2050]
3 [2051-01-20, 2070-01-20]
discarded
4 [2040-12-12, 2050-01-01]
[2040, 2050]
7.3 Data Structures & Algorithms 153
The mapping algorithm uses the following types of information to deter-
mine the mapped value:
– the normalized (or de-normalized) value, and
– the position of the value to be mapped regarding the interval (i.e., is
the value the start or the end endpoint of the interval).
If the value is the start value of an interval, it picks the smaller value, oth-
erwise the larger value is chosen (cf. Table 7.1, #1 and #4). Thus, looking
at the value 2015-06-22 and the top time axis of Figure 7.7, the mapping
algorithm would pick 2015 for the start value, and 2016 for the end value.
Another mismatch occurs, if the provided value exceeds the limits of the
time axis. In that case, the default mapping algorithm maps the value to the
boundary of the time axis, if and only if the other value of the interval does
not exceed the same boundary (cf. Table 7.1, #2). If both values exceed the
same boundary, the interval is discarded (cf. Table 7.1, #3). The last mis-
match that may occur addresses missing values. As already introduced in
section 7.2.1, three strategies are implemented, which can be picked by
configuration. By default, the algorithm applies the boundariesWhenNull
strategy for missing values.
Descriptor Data Structure
A descriptor is managed as a collection of descriptor values. The collection
is thereby optimized for the retrieval of descriptor values using an internally
used identifier, the value, or the unique string representation of the value
(cf. section 7.3.2 for an introduction of the used indexes). Whenever a new
descriptor value is added to the collection the following algorithm is applied:
the value is validated (i.e., according to the specifications defined, e.g., is
null allowed as value and is it unique), a unique identifier is generated using
the specified or default identifier factory (cf. section 7.2.1), and added to
indexes.
A descriptor value is also represented by a data structure63, which
63 The descriptor value is realized as a class, which is assumed to be a data structure as well
following Martin (2009, pp. 93–101).
154 7 TidaIS: An Information System for Time Interval Data
provides the identifier, the value, the unique string representation, and the
fact function. The fact function is thereby optimized (i.e. the fact is only re-
trieved once, if the type is value- or record-invariant, cf. section 4.2 or 7.3.4)
to increase the performance.
Descriptive Mapping Function
The previous subsection described the data structure used to represent a
descriptor and its values. However, the creation of a new descriptor value
was not introduced. Whenever a descriptive value is pushed to the system,
the system picks up the descriptor the descriptive value belongs to, e.g.,
specified by the structure of the insert statement (cf. section 5.3.1). To de-
termine the descriptor values associated to the descriptive value, the de-
scriptor utilizes the defined mapping function (cf. section 4.2 and 7.2.1).
Figure 7.8 illustrates the handling of an insert statement and the utilization
of the descriptive mapping function to determine the involved descriptor
values.
Figure 7.8: Illustration of the algorithm used to map descriptive values, e.g., [flu, cold] to the descriptor values flu and cold.
Processing a Raw Data Record
Whenever a record is added to the system, the system validates the record
(i.e., by applying the different mapping functions and validation strategies)
7.3 Data Structures & Algorithms 155
and assigns a unique identifier to the record. Once assigned, the unique
identifier cannot be used again by any other record. Nevertheless, cleaning
procedures can be scheduled for any model to reset and reuse available
identifiers (e.g., if a record was deleted). The system is capable of creating
263 – 1 = 9,223,372,036,854,775,807 (i.e., in words more than nine quintil-
lion64) unique identifiers. However, it is worth mentioning, that the currently
available, different bitmap implementations only support the usage of int-
values as position, i.e., 231 – 1 = 2,147,483,647 (in words more than two
billion). Because of the importance of bitmaps for the indexing (cf. section
7.3.2), the system is capable of handling 2 billion raw records with the valid
record index. The whole process of the assignment of a unique identifier is
thread-safe and thereby ensures that no identifier is used several times.
Figure 7.9 exemplifies the processing of a raw data record assuming the
specified time axis definition, the assignment of an identifier of 7 to the
descriptive value cleaning of the descriptor department, as well as the al-
location of a unique identifier of 5 to the record.
Figure 7.9: Example of a result of the processing of a raw data record
Validating a Descriptor Dimension
The validation of a descriptor dimension is performed whenever a dimen-
sion is added to the system, e.g., by configuration (cf. section 7.2.1). The
64 Since Java 8 introduced unsigned int- and unsigned long-values, this number may be in-
creased to 264 - 1 in the future, respectively 232 - 1 for int-values.
156 7 TidaIS: An Information System for Time Interval Data
algorithm checks every hierarchy of the dimension, by testing the criteria
specified in section 4.4, i.e.,
1. there is only one sink (a.k.a. root),
2. the sink is reachable from every node,
3. every source is referring to a descriptive values, and
4. a partial order over a partition of all nodes is provided.
The validation of the first three criteria, i.e., 1 – 3, is performed by iteration
over the defined nodes. The algorithm starts by picking a node randomly. It
follows the paths to the sink and assigns the minimal and maximal distance
to the sink to each node. If an already assigned node is found, the algorithm
validates, if
– the node was assigned in the same iteration (if so an exception is
thrown, because a loop was found),
– the node is a sink (e.g., has no parents), the algorithm stops, or
– the node cannot reach any sink (if so an exception is thrown, because
criterion (2) is not met).
Afterwards, the algorithm validates if exactly one sink was found (1) and if
every source is a referring to a descriptive value (3). In addition, the algo-
rithm checks if the partial order is provided, by checking the minimal and
maximal distances calculated (4).
7.3.2 Indexes
The system utilizes several index structures to increase the performance
of filtered data, aggregation, and distance calculation. In this section, sev-
eral indexes are introduced and held in main or secondary memory. The
decision regarding the type of memory and used index structure depends
on different aspects, i.e., the number of entities held within the index and
the type of data (e.g., descriptor values or data). Using a secondary
memory typically includes the utilization of a cache, so that performance is
increased (cf. 7.1.2 and 7.3.3). In the following, the index structure used for
descriptors, the bitmap-based index structure used to increase the perfor-
mance of data related tasks, and the indexing of raw data is introduced.
7.3 Data Structures & Algorithms 157
Indexing Descriptors
The collection of the different descriptor instances (i.e., the descriptors)
has to be searched for the unique identifier of a descriptor, which is typically
a string. The number of entities created within a model is, contrary to the
number of data, expected to be small. Thus, a main memory index structure
is utilized. Several tests showed that a HashMap performs best in the case
of strings (cf. section 8.2.1) having in average a complexity of O(1) (cf.
Goodrich, Tamassia (2006, pp. 374–390)). Thus, the implementation of the
descriptors class is based on a hash map, to collect all the descriptor in-
stances and search for one using the unique identifier. Figure 7.10 depicts
the main memory index structure used by the implementation of the de-
scriptors.
Figure 7.10: Illustration of the index structure (HashMap) used by the descriptors index (cf. Goodrich, Tamassia (2006)).
In addition to the search for specific descriptors, it is also important to
be able to search for descriptor values. The different descriptor values are
managed and collected by a descriptor and to find a specific descriptor
value the following attributes are typically used:
– the internally used identifier (used internally by the indexes),
– the value, which might be an object, or a primitive value (used to detect
duplicates), or
158 7 TidaIS: An Information System for Time Interval Data
– the unique string representation of the value (used when parsing que-
ries).
In general, a main memory index is created for all of these attributes using
the IndexFactory to select the best fitting index (cf. section 7.2.1 and 7.2.2).
In case of the indexes, utilized for the internal identifier and the value, high
performance collections are typically chosen. The index for the unique
string is, as the one for the descriptors and if not others configured, a
HashMap.
Indexing Data for Filtering, Aggregation and Distance Calculation
When retrieving, aggregating, or calculating the distance between da-
tasets, it is important that the selecting of the dataset is performed fast. In
the field of data analysis, the dataset is typically filtered by several attrib-
utes and aggregated (Kimball, Ross 2002; Abdelouarit et al. 2013). In the
case of time (interval) data analysis, the dataset is additionally partitioned
over time prior to aggregation (Kline, Snodgrass 1995; Böhlen et al. 2008).
Figure 7.11 illustrates a typical processing of an analytical query. First, the
filter is applied to retrieve the subset of relevant data from the database.
The resulting subset is partitioned and the aggregation is applied for each
partition.
Figure 7.11: The different tasks (filtering, partitioning, and aggregating) to be per-formed to handle an analytical query.
7.3 Data Structures & Algorithms 159
It is a matter of common knowledge that bitmap indexes outperform typ-
ical tree-based index structures when the used filter addresses several at-
tributes (cf. section 3.3.1, Abdelouarit et al. (2013)). However, the usage of
bitmap indexes to apply different aggregation operators is, with the excep-
tion of count and some context specific operations (e.g., Kaser, Lemire
(2014)), not common. In this section, a bitmap-based index structure is pre-
sented which increases the performance of filtering, aggregation, and also
distance calculation (with regards to the introduced TIDADISTANCE, cf. chap-
ter 6).
The index structure consists of four indexes: valid record index, data
descriptor index, time axis index, and fact descriptor index. Each of the
indexes is motivated in detail in the following, starting with the valid record
index.
The valid record index is used to determine if a record is still valid, i.e.,
not deleted. It only consists of a bitmap (called the tombstone bitmap),
which contains a 1 at the position determined by the record’s unique iden-
tifier, if and only if the record is added correctly and not deleted. The index
is cached and stored, but typically resists in main memory, because of its
frequent usage.
The second index to be introduced is the data descriptor index. It is used
to assign a record to its associated descriptor values. By default, the index
utilizes a HashMap to map a descriptor identifier (i.e., a string) to an array-
like index structure. The array-like index structure associates the internal
identifiers (typically primitives) to bitmaps. Normally, the array-like index
structure is realized by a high performance collection, i.e., by default one
of Trove’s array list implementations. Figure 7.12 depicts the data de-
scriptor index. The complexity of the retrieval of a bitmap, which may be
loaded from the secondary memory if not cached, is in average O(1)65.
65 The retrieval of the collection from the HashMap is O(1). In addition, the high performance
collections typically utilize an array, which has also a search complexity of O(1). In addition, to determine the internal identifier of a specific descriptor value, the descriptors index may be utilized, which has an average search complexity of O(1).
160 7 TidaIS: An Information System for Time Interval Data
When adding a new record to the system, the bitmaps of the descriptive
values associated to the record descriptive values are set to 1, at the posi-
tion specified by the record’s unique identifier.
Figure 7.12: The data descriptor index, using by default a HashMap and a high performance collection (Trove) to index bitmaps.
The third index used in the context of indexing a record is the time axis
index. The structure of the index, used to retrieve time related entities, is
presented in section 7.3.1. The used array structure ensures (in the fixed
form) a retrieval of the associated bitmap in O(1). The bitmap of a chronon
is set to 1 at the record’s identifier position, if and only if the interval of the
record contains the chronon.
To ensure fast retrieval of the facts associated to a specific record, a
fourth index is created and maintained. The so-called fact descriptor index
retrieves all the facts associated to a descriptor if the fact is value- or rec-
ord-invariant (cf. section 4.2). In addition, it provides a list of descriptor val-
ues having the specified fact as a result of their fact function. More specific,
the index is used to retrieve all the facts for a specific descriptor and the
corresponding descriptor values of the fact. The index is sorted ascending
by fact and collects statistical values like amount of not-a-number and
amount of number facts. If the descriptor contains record-variant facts, the
index will return a null-pointer. The complexity of the index to add a value
is, because of the TreeSet, O(log n). The retrieval of specific values from
the TreeSet is typically not performed. Instead, the minimum, maximum, or
an iterator is retrieved, whereby these operations have a complexity of
7.3 Data Structures & Algorithms 161
O(1). The index persists the sets and may have to load them from the sec-
ondary memory if not cached. Figure 7.13 illustrates an example of the
index structure. For each descriptor a reference to a tree-set like structure
is stored, which holds the value- or record-invariant facts associated to the
descriptor values.
Figure 7.13: Example of the structure of the fact descriptor index, associating facts with descriptor values.
Indexing Raw Data Records
The previously introduced indexes are used to retrieve specific information
about the stored records, i.e., the bitmaps are used to associate the record
to a specific value, whereby the fact descriptor index keeps track of the
facts used. In some situations the retrieval of raw records is necessary,
e.g., when dealing with record-variant facts or if requested by a query.
When retrieving a record, the system will typically determine the unique
identifiers. Thus, the retrieval can be easily performed by a primary key
(i.e., the unique identifier of the record). Modern DBMS are designed to
perform exactly these tasks. Thus, in a productive system, the information
system should outsource this task and utilize a DBMS. Nevertheless, for
non-productive systems the information system offers the functionality to
162 7 TidaIS: An Information System for Time Interval Data
keep the records in-memory, use a map-based embedded database en-
gine66, or reconstruct a record from the known information available within
the bitmap indexes67.
Using the Indexes for Filtering and Grouping
This section describes the algorithm used to filter and group (i.e., creating
the subsets) the dataset. The algorithms used to aggregate, calculate the
distance, or apply analysis are based on this result. The process is shown
in Figure 7.11 for the case of aggregation. Figure 7.14 depicts an example
database and the state of the indexes (with the exception of a raw records
and the descriptors index). The time axis is assumed to have a minute
granularity, starting at 00:00, and ending at 23:59 (of some random day;
time zone UTC). In addition, two descriptors are defined: the type de-
scriptor using a record-invariant fact function (i.e., cleaning always is
mapped to the value 4 and fueling to 2) and the pos descriptor using a
value-invariant fact function (i.e., always returning 1). Furthermore, one of
the intervals is associated to two descriptor values of the type descriptor
(creating a many-to-many relationship, cf. the summarizability problem
mentioned in section 3.2.1).
66 http://www.mapdb.org 67 The reconstructed record does not reflect the raw record, but contains all data of the record
known by the system, i.e., descriptor values, start and end time, and unique identifier.
7.3 Data Structures & Algorithms 163
Figure 7.14: An example database with data related indexes.
The following select statement is used to exemplify the filtering and group-
ing algorithm:
SELECT TIMESERIES OF SUM(type) FROM sampleModel IN [10:44, 10:45]
WHERE type = 'fueling' GROUP BY type, pos EXCLUDE {('cleaning', '*')} .
After parsing the example query, the filtering and grouping algorithm is ap-
plied. First, the algorithm retrieves the bitmaps referred to in the WHERE-
part of the statement utilizing the data descriptor index, i.e., in the example
the fueling bitmap. The algorithm evaluates the specified logical conditions
(i.e., AND, OR, or NOT) by applying the equivalent logical bitmap operation
to the retrieved bitmaps. The result of these operations is called the filter
bitmap, in the example the filter bitmap is (0, 1, 1). In the next step, the
algorithm retrieves the tombstone bitmap from the valid record index and
AND-combines it with the filter bitmap, resulting in the valid-filter bitmap, in
the example the valid-filter bitmap is equal to the filter bitmap. Afterwards,
the different groups have to be determined. This is done using the de-
scriptors index, which is used to retrieve all descriptor value instances for
164 7 TidaIS: An Information System for Time Interval Data
a specific descriptor. The algorithm combines the different descriptor val-
ues with each other, validates specified includes and excludes, and creates
for each group the resulting bitmap (using the data descriptor index) by
AND-combining all descriptor value bitmaps of a group. Table 7.2 shows
two examples (one as defined in the sample query) of resulting bitmaps
created for a group by expression.
Table 7.2: Examples of different group-bitmaps created for specific GROUP BY expressions based on the example database shown in Figure 7.14.
GROUP BY Groups Bitmap
type, pos EXCLUDE {('cleaning', '*')} 1: (fueling, A32), 2: (fueling, B35)
1: , 2:
pos, type INCLUDE {('B35', 'cleaning')} 1: (cleaning, B35)
1:
Thus, the final result of the algorithm returns two bitmaps, i.e., (0, 0, 1) for
the (fueling, A32) group and (0, 1, 0) for the (fueling, B35) group. Summa-
rized the algorithm performs the following steps:
1. evaluate filter condition (apply the descriptors index to retrieve the in-
ternally used identifiers) and create the filter bitmap (utilizing the data
descriptor index),
2. retrieve the tombstone bitmap (from the valid record index) and com-
bine it with the filter bitmap to retrieve the valid-filter bitmap,
3. determine the different groups (using the descriptors index to resolve
strings) and create a group-bitmap for each group entry, and
4. combine the valid-filter bitmap with each group-bitmap to create a set
of valid-filter-group bitmap instances for each specified group.
If the level of a descriptor dimension is used within the group by expression,
the algorithm processes the same steps. Instead of retrieving the bitmaps
7.3 Data Structures & Algorithms 165
for each descriptor value when creating the group bitmaps, the algorithm
fetches the bitmaps associated to each member of the level and creates a
group bitmap for each member. Figure 7.15 depicts the process assuming
that an additional descriptor value (i.e., B40) is added without having any
data associated.
Figure 7.15: Illustration of the group bitmap calculation, in the case of the usage of a dimension’s level within the group by expression.
To determine the final result of the query, the specified aggregation op-
erator has to be applied. The algorithm used to determine the final aggre-
gated results, based on the different valid-filter-group bitmaps, is presented
in section 7.3.4. The implementation of the frequently mentioned data re-
trieval from the cache regarding bitmaps (and the so far not further utilized
fact sets) is introduced in the next section, i.e., section 7.3.3.
7.3.3 Caching & Storage
The caching technique and secondary memory utilization depends on the
configuration of the caches (cf. section 7.2). By default, caches are pro-
vided by, e.g., libraries like ehCache, any modern DBMS, or the object-
relational mapping framework Hibernate. Nevertheless, a concrete imple-
mentation of a cache for the information system should be independent
166 7 TidaIS: An Information System for Time Interval Data
and decide to use an own implementation or to utilize a caching library. The
information system provides techniques, enabling the usage of any cache
regarding the releasing of objects from the cache.
The important aspect is: How a reference of a cached object is handled
within the information system. In general, a reference (e.g., in Java) is as-
signed to be a strong reference, i.e., the object referred is not eligible for
garbage collection as long as the reference exists. Regarding caching,
such a strong reference is helpful as long as the entity is needed, i.e.,
whenever a query is processed. Nevertheless, keeping a strong reference
to an object, which is managed by an underlying cache may lead to
memory problems, because the cache is not capable to remove the object
from main memory as long as other instances hold a strong reference
(Jones et al. 2012, pp. 11–15). If, on the other hand, the cache is capable
to inform the instance keeping the strong reference, the instance is able to
remove the reference. Thus, two different strategies have to be considered
by the information system: (1) a cache publishing the release of an object
to a listening instance or (2) a cache removing the reference to release
memory without any notification.
To support the different types of caches, the information system pro-
vides two interfaces, i.e., the IReleaseMechanismCache and the
IReferenceMechanismCache. The former is used by caches capable to in-
form another instance about the release. The interface forces the cache to
offer a method to register an observer. The information system registers
such an observer and whenever the observer is informed, the strong refer-
ence is removed, so that garbage collection can take place. The latter in-
terface is used by caches, which do not provide any information about re-
moved objects. In that case, the information system holds an instance us-
ing a weak references (Jones et al. 2012, pp. 221–226). Whenever an ob-
ject is requested (e.g., when processing a query), the weak reference is
validated and a strong reference is returned. As long as the object is
needed (e.g., by the query processor) a valid reference is available. When
the strong reference is removed (e.g., because the processing is finished),
7.3 Data Structures & Algorithms 167
the information system has only a weak reference left. Thus, the cache is
capable of managing the objects without publishing any information about
a release.
7.3.4 Aggregation Techniques
As mentioned, aggregating data is one of the pre-dominant operations
used in data analysis. The performance of the aggregation is crucial for any
system. Several performance increasing possibilities have been introduced
in the last years as presented in section 3.3.2. In this section, the algorithm
to calculate aggregates of the form STA and TAT, based on the presented
indexes from section 7.3.2., is introduced. Especially, the array-based time
axis index is of importance to quickly retrieve the bitmaps of the chronons.
Span Temporal Aggregation
The aggregation algorithm expects a set of valid-filter-group bitmap in-
stances to be passed, as well as the parsed query. Receiving these param-
eters, the algorithm determines the relevant chronons selected by the
statement. Furthermore, the algorithms checks if a partition, in form of a
dimension’s level, is specified within the statement. Looking at the follow-
ing, previously used, example statement the algorithm determines the
chronons representing 10:44 and 10:45, as well as the absence of a di-
mension’s level:
SELECT TIMESERIES OF SUM(type) FROM sampleModel IN [10:44, 10:45]
WHERE type = 'fueling' GROUP BY type, pos EXCLUDE {('cleaning', '*')} .
Based on this information and the passed parameters, the algorithm is ca-
pable of performing the aggregation for each single chronon, by applying,
for each bitmap associated to a chronon, a logical AND-operation with the
valid-filter-group bitmap of each group. The result is a list of final bitmaps,
which can be used to calculate any aggregation using STA. Figure 7.17
illustrates the final bitmaps for the different chronons and groups, i.e., (fuel-
ing, A32, 10:44), (fueling, B35, 10:44), (fueling, A32, 10:45), and (fueling,
B35, 10:45).
168 7 TidaIS: An Information System for Time Interval Data
Figure 7.16: The four resulting bitmaps for the different chronons and groups.
Based on the final bitmap and the fact descriptor index, the algorithm
calculates the aggregated value for each chronon. Table 7.3 shows the bit-
map-based algorithms for each aggregation operator. Some operators uti-
lize the fact descriptor index (referred to as factDescIdx) to retrieve the facts
associated to the specified descriptor. The implementation provides the
possibility to iterate in ascending, descending, or random order. In addition,
the iterator returns descriptor values, which can easily retrieve their fact (if
record-variant, the raw data record index is used) and the associated bit-
map (using the descriptors index). The iterator retrieved from the fact de-
scriptor index is also bitmap-based and uses internally the final bitmap,
which is passed as parameter when creating the iterator. The algorithm
utilized for iteration, combines the final bitmap with the one of the current
descriptor value (i.e., which is associated to the current fact) and returns
as many times the current fact as the received bitmap’s cardinality (i.e.,
count). The complexity of the algorithms may be determined by considering
that:
7.3 Data Structures & Algorithms 169
– the count-operator can be assumed to perform in O(1) ("computing the
cardinality of a Roaring bitmap can be done quickly: it suffices to sum
at most ceil(n/216) counters" (Chambi et al. 2015)),
– the iteration is done in O(m) (with m being the cardinality of the de-
scriptor), and
– the complexity of logical operations is "O(n1 + n2) time, where n1 and
n2 are the respective lengths of the two compared arrays" (Chambi et
al. 2015).
However, the latter statement depends on data added to the system, i.e.,
the size of the arrays cannot be determined. Thus, a simple average com-
plexity cannot be provided. Nevertheless, Chambi et al. (2015) state that
"we can compute and write bitwise ORs at 700 million 64-bit words per
second", which sounds sufficient, even if they state further that if they "com-
pute the cardinality of the result as we produce it, our estimated speed falls
to about 500 million words per second".
Table 7.3: List of algorithms used to calculate the different aggregated values.
aggregation operator
Aggregation Algorithm bf ≙ final bitmap, bt ≙ bitmap of chronon
sum
it = factDescIdx.iterator(bf);
res = NaN;
while (dv = it.next())
res += dv.fact ∙ count(dv.bitmap AND bf);
return res;
median it = factDescIdx.ascIterator(bf);
even = (count & 1) == 0;
firstPos = floor(count * 0.5) + (even ? ‐1 : 0);
curPos = 0;
while (curPos != firstPos) {
it.next();
curPos++;
}
if (even) {
return 0.5 ∙ (it.next().fact + it.next().fact);
} else {
return = it.next().fact;
}
170 7 TidaIS: An Information System for Time Interval Data
mode it = factDescIdx.ascIterator(bf);
lastFact = NaN;
mode = NaN;
maxAmount = 0;
counter = 0;
while (it.hasNext()) {
fact = it.next().fact;
if (lastFact == fact) {
counter++;
continue;
} else if (counter > maxAmount) {
maxAmount = counter;
mode = lastFact;
} else if (counter == maxAmount) {
mode = NaN;
}
counter = 1; lastFact = fact;
}
if (counter > maxAmount) {
return lastFact;
} else if (counter < maxAmount) {
return mode;
} else {
return NaN;
}
count return count(bf)
min it = factDescIdx.ascIterator(bf);
return it.getNextFact();
max it = factDescIdx.descIterator(bf);
return it.getNextFact();
mean return sum / count;
count finished return count((bf XOR bt+1) AND bf)
count started return count((bt‐1 XOR bf) AND bf)
Two-Step Aggregation Technique
As mentioned at the beginning of the previous section, the algorithm
checks the parsed query for the relevant chronons, as well as a defined
partition. If a partition is defined, the TAT can be applied to calculate the
aggregated value across the partitions. Modifying the previously used sam-
7.3 Data Structures & Algorithms 171
ple query by specifying a dimension’s level (i.e., HOUR), a second aggre-
gation operator (i.e., MAX), and a different time window (i.e., [10:00,
12:00]), the following sample query is assumed:
SELECT TIMESERIES OF MAX(SUM(type)) ON TIME.PARTITION.HOUR
FROM sampleModel IN [10:00, 12:00]
WHERE type = 'fueling' GROUP BY type, pos EXCLUDE {('cleaning', '*')} .
To determine the result of the query, the algorithm performs the same steps
as previously explained in the case of STA. After all values for a specific
partition are retrieved, the algorithm simple applies the second operator to
the set of retrieved numbers (which might have to be sorted, e.g., in the
case of median).
SELECT TIMESERIES OF SUM(type) ON TIME.PARTITION.HOUR
FROM sampleModel IN [10:00, 12:00]
WHERE type = 'fueling' GROUP BY type, pos EXCLUDE {('cleaning', '*')} .
In that case, the chronons of each partition are OR-combined prior to
be combined with the valid-filter-group bitmap. Figure 7.17 illustrates TAT
and STA and the differences when aggregating.
Figure 7.17: Illustration of TAT and STA.
7.3.5 Distance Calculation
In this section, the algorithms for the different distance measures, intro-
duced in chapter 6, are presented. Two of the three distance measures (i.e.,
the temporal order distance and the temporal measure distance) are based
172 7 TidaIS: An Information System for Time Interval Data
on results already presented. Nevertheless, an important aspect to in-
crease the performance when calculating distance measures are criteria,
which define that the calculation can be terminated. In general, these cri-
teria are based on upper bounds, which ensure that a distance cannot be
smaller than the calculated bound. Thus, for the two mentioned distances,
the focus lays on the definition of an upper bound, whereby the focus for
the third distance, i.e. the temporal relational distance, is focused on the
algorithm. The section is divided into three subsections. The first one intro-
duces the abort criterion for the temporal order and measure distance, as
well as the algorithm. The second discusses the temporal relational dis-
tance and the third discusses how to combine the different algorithms effi-
ciently.
Temporal Order and Measure Distance
The temporal order and measure distance can be calculated by retrieving
the time series applying the algorithm used to process a query. The tem-
poral order can be seen as a special case of the measure distance. In gen-
eral, the measure distance is calculated based on a time window, a meas-
ure, and optionally a level of the time dimension. In the case of the temporal
order distance, the measure is count calculated on the lowest granularity
(i.e., no level of the time dimension is selected). As defined by Definition
17 and Definition 19, the distance is the sum of the difference between the
mapped time points. The calculation of a distance can thereby be aborted
as soon as the current distance is larger than the largest distance of the
found k-NN. Figure 7.18 illustrates the abort criterion. The current dataset
is compared to the source by calculating the distance for each time point.
As soon as the distance value Dist is larger than 11,386 the calculation is
aborted. If the calculation reaches the end, the dataset will be added to the
list of k-NNs and the last entry will be removed.
7.3 Data Structures & Algorithms 173
Figure 7.18: Illustration of the abort criterion for the temporal order and measure distance.
It should be mentioned that the time series is iteratively calculated. Thus,
the abortion also ensures that no further values of time points are calcu-
lated. Furthermore, the calculation works analogously if groups are speci-
fied. In that case, the distance between each group is calculated and
summed up. As soon as the current distance exceeds the distance of the
last entry of the list of the k-NN, the calculation is terminated and the next
subset is evaluated.
Temporal Relational Distance
To calculate the temporal relational distance, it is necessary to determine
the relation between each pair of intervals of the subset and assign it to a
time point as specified in Table 6.1. Within the next step, the distance be-
tween the ordered set of vectors is calculated as defined by Definition 18.
Because of these two steps, it is necessary to scan the complete subset
174 7 TidaIS: An Information System for Time Interval Data
and calculate all relations, prior to applying any abortion criterion. The re-
lations can thereby be determined in O(n), with n being the amount of
chronons covered by the subset, and a memory usage of O(n + m2), with
m being the amount of intervals contained within the subset.
Figure 7.19 illustrates the bitmap-based algorithm, which determines
the relations between two intervals. The algorithm iterates over each time
point, determining the bitmap for the current group of the current time point
(as described in section 7.3.2: Using the Indexes for Filtering and Group-
ing). In the next step, the determined bitmap is combined with the bitmap
of the previous time point68 to create three bitmaps: (1) a bitmap with all the
intervals just started, (2) a bitmap with all the intervals finished, and (3) a
bitmap with all the intervals still being active. Each bitmap can be easily
determined by logically combining the previous bpre and the current bitmap
Figure 7.19: Illustration of the algorithm used to determine the relations between intervals.
68 If no previous bitmap is present, e.g., in the case of the first time point, the bitmaps are assumed to be empty.
7.3 Data Structures & Algorithms 175
bcur, i.e.: (1) (bpre ⊕ bcur) ∧ bcur, (2) (bpre ⊕ bcur) ∧ bpre, and (3) bpre ∧ bcur. Within
the last step, the algorithm collects new information for each current pair,
i.e., start-relation, end-relation, and start-end-relation. Whenever all three
relations of a pair are known69, the algorithm is capable of specifying the
relation of the pair and the referred time point. A pair is thereby referred to
by a unique identifier, which is determined by the pairing-function shown in
Listing 7.18.
Listing 7.18: The pairing function used to determine a unique identifier for a pair of intervals.
function long uniqueId(recId1, recId2) {
long x = Math.max(recId1, recId2);
long y = Math.min(recId1, recId2);
return (long) (0.5 * (x * (x + 1)) + y);
}
Having all relationship and the endpoints of the pairs it is easy to calcu-
late a time series based on the formulas shown in Table 6.1. After the cal-
culation of the time series, the distance is calculated by determining the
distance between each mapped time points. The abortion criterion can be
applied as described for the order and measure distance. Nevertheless,
the termination has small impact regarding the performance, because the
expensive detection of relations has to be performed.
Calculating the Temporal Similarity Measure
The temporal similarity measure can be applied by calculating the different
distances using the algorithms presented in this section. A distance is
thereby only calculated if the weighting factor is larger than 0. In addition,
to increase performance, the temporal relational distance is only calculated
69 Figure 7.19 shows that it is not always necessary to know all three relations. In the case of,
e.g., ends-by it is enough to know the start and start-end relation.
176 7 TidaIS: An Information System for Time Interval Data
if the weighted distance applying the temporal order and measure distance
is not larger than the currently last k-NN.
7.4 User Interfaces
The GUI was implemented separately from the information system’s back
end. The aim of this separation was to ensure that the information system’s
interfaces, as well as the provided information, are sufficient regarding the
requirements of a user interface (cf. VIS-05). The implementation of the
GUI is web-based, i.e., utilizing HTML5, JavaScript, and CSS, and is based
on the Bootstrap framework70. In addition, the Highcharts library71 was uti-
lized to visualize time series and a Gantt-chart widget was implemented
using SVG and the jQuery library72.
Figure 7.20 illustrates different screenshots of the user console of the
GUI. The figure shows the login screen, the model, data, and user man-
agement (cf. VIS-03), as well as the UI for analytical tasks (cf. VIS-04). The
different shown sections of the user interface are described in the following:
– The model management is used to add or delete new models to the
system. In addition, a model can be loaded and unloaded. The latter is
useful if the model is not needed anymore and thereby can be removed
from memory. An unloaded model is not available anymore, e.g., for
querying.
– The data management enables the user to insert or remove time inter-
val data to or from a model. It is possible to add data from the model
itself, a CSV file, a DB query, or as single record using the UI.
– The user management allows the creation of new, deletion of old, and
editing of available users or roles. A user is defined by a name, a pass-
word, the assigned roles, and the granted permissions. A role, on the
other hand, is specified by its name and the permissions.
70 http://getbootstrap.com 71 http://www.highcharts.com 72 https://jquery.com
7.4 User Interfaces 177
– Last but not least, the interface for analytical tasks provides the possi-
bility to fire SELECT statements against the information system, which
are, depending on the statement, illustrated as time series or as a
Gantt-chart.
Figure 7.20: Overview of the user console of the implemented UI: top-left shows the login screen, top-right is a screenshot of the model management, middle-left is a picture of the data management, middle-right illus-trates the user management, and the screenshots on the bottom show the time series visualization (left) and the Gantt-chart (right).
The UI provides additional sections which are not shown within the figure,
e.g., documentation (provides tutorials and explanations about the different
178 7 TidaIS: An Information System for Time Interval Data
interfaces, the query language, services, and configuration) or home
(shows some general information on how to get in contact or participate).
Besides the actual GUI, a JDBC driver is provided to enable the usage
of the data within third party tools (cf. DC-02, PR-02, VIS-01). These tools
can be utilized to load data into the different models (e.g., a data integrator
firing INSERT statements), visualize information on a dashboard (e.g., us-
ing a modern web-framework), or create reports (e.g., by applying a BI
tool). The JDBC driver can be used to fire any query (cf. chapter 5) against
the information system.
7.5 Summary
In this chapter, the system architecture and its components were pre-
sented, which are motivated by multiple feature requests (e.g., DA-01, PD-
01, PD-02, MA-01, or MA-02) and performance considerations. Further-
more, the possibilities of configuring the model and system were intro-
duced. The configuration allows extending or replacing several compo-
nents of the system so that the information system is highly configurable to
domain-specific needs.
Several aspects of the realization were introduced, e.g., data structures,
indexes, and algorithms. The introduced indexes can be used in a holistic
way, i.e., the indexes can be used for different techniques like filtering,
grouping, aggregation, pre-aggregations, distance measures, and distrib-
uted calculations and are the answer to RQ4: "Which indexing techniques
can be used to process user queries and how should data be cached, as
well as persisted". In general, an optimized solution may be considered for
each individual task. Nevertheless, from a system perspective it is im-
portant to utilize an index, which is capable of supporting as many features
as needed. Furthermore, the performance measures, depicted in the next
chapter, show that the implementation outperforms state of the art propri-
etary software, which, e.g., does not support as many aggregation forms
7.5 Summary 179
and operators, needs expensive integration processes, and does not pro-
vide as many time related features.
Within the section, a minimal and functional GUI was presented (cf.
VIS-03 and VIS-04). The GUI can be used to analyze time interval data and
visualize requested results. The presented implementation is developed in-
dependently from the information system and utilizes the provided web-
services. Thus, it can be understood as a prototypical implementation of a
domain-specific GUI using the information system to enable users to ana-
lyze time interval data (cf. VIS-05). Furthermore, the JDBC driver was in-
troduced as another UI enabling the usage of the information system within
third party tools (cf. VIS-01).
Summarized, this chapter provides the answer to RQ6: "How should the
architecture of an information system for time interval data analysis be re-
alized, how should the system be configured, and which interfaces have to
be provided to support the analyzing process". As stated, the architecture
and the components are presented, the configuration capabilities are intro-
duced, and the different interfaces are shown.
8 Results & Evaluation
The realization of the feature requests (cf. section 2.2) in a performant way,
is an important criterion for the evaluation of the implementation of the in-
formation system. In addition, the user acceptance and usability are criteria
to be validated. In the first section of this chapter, i.e. section 8.1, the fulfill-
ment of the features requested is validated. Furthermore, the feedback of
current users regarding the features and the processing performance is
discussed. In section 8.2, available high performance collection libraries,
the performance and memory usage of the system, and selected algo-
rithms are evaluated. For comparison of the query language processing, a
main memory based version of the IntervalTree (cf. Edelsbrunner, Maurer
(1981), Kriegel et al. (2001)) was implemented. In addition, proprietary
tools were utilized to test the information system’s performance against ic-
Cube and the Oracle DBMS (following Song et al. (2001), Mazón et al.
(2008), and Niemi et al. (2014) to detect and solve (if possible) occurring
summarizability problems).
8.1 Requirements & Features
Besides the performance of the system, the fulfillment of the features re-
quested and requirements formulated in section 2.2 are an important qual-
ity criterion of the implemented information system. Within the chapters,
the different features were addressed and used as motivation for specific
decisions. Table 8.1 shows the specified features, explains the realization,
and the degree of support. The comments presented for some features are
given by users of the information system. In general, these comments can
be understood as enhancement requests or new feature requests.
© Springer Fachmedien Wiesbaden GmbH 2016P. Meisen, Analyzing Time Interval Data,DOI 10.1007/978-3-658-15728-9_8
182 8 Results & Evaluation
Table 8.1: Overview of the different features requested, the realization of the fea-ture, as well as comments of the users (if available), and the degree of realization.
Feature Realization and Comments degree
DA-01, DA-02
Aggregation of time interval data The different aggregation operators of time interval data are supported by the query language (cf. sec-tion 5.3.3) and the bitmap-based implementation is introduced in section 7.3.4 (cf. Table 7.3). In addi-tion, the extensibility of operators is presented in section 7.2.2.
DA-03 Temporal operators to retrieve raw records The introduced query language (cf. section 5.3.3) allows the specification of time window using tem-poral operators (cf. Figure 5.1).
DA-04, DA-05
Definition of dimensions, hierarchies, and lev-els Descriptor dimensions and the time dimension is specified in detail in section 4.4 (cf. Definition 14 and Definition 15). The requested roll-up and drill-down operations are supported by the query lan-guage (cf. section 5.3.3). The implementation of the operations is aggregation based and intro-duced in section 7.3.4.
Comments The definition of a level’s member is specified by regular expressions. Regular expressions are sometimes difficult to be formalized (especially for number ranges). An alternative, more user-friendly expression language is desired. In addition, it would be nice to load dimensions, e.g., from a table of a database.
DA-06 Support of different time zones The support of different time zones is achieved by the presented dimensional model (cf. section 4.4). To retrieve the data from a specific time zone de-pendent view is utilized within a SELECT state-ment (cf. section 5.3.3).
8.1 Requirements & Features 183
Comments To combine data from different time zones within one query a UNION statement should be available.
DA-07 Similarity Measure The comparison of different sets of time interval data is achieved by the presented similarity meas-ure (cf. chapter 6). The measure can be used by firing an analytical query (cf. section 5.3.3). The implementation of the calculation is shown in sec-tion 7.3.5.
DA-08 Query language The query language is introduced in detail in chap-ter 5. It is divided into a DCL, DDL, and DML, cov-ering all the requested features. Comments When a model is modified, it has to be removed and added as new model. The language should be extended to support the update of models.
PD-01 Notification system The system is capable of triggering a job, when-ever a specific event occurs within the system (cf. Figure 7.1). The system itself supports the assign-ment of jobs to core events or user-defined events (cf. section 7.2.1), e.g., triggered as result of an analysis.
PD-02 Analytical algorithms To allow the execution of analytical algorithms, the system provides an analysis manager (cf. Figure 7.2). In addition, the query language (cf. 5.3.3), as well as the configuration (cf. 7.2.2) support the us-age and binding of analytical algorithms (e.g., pat-tern or association rule mining).
PR-01 Prediction of upcoming situations The prediction of an upcoming situation is an ana-lytical task. The utilization of an algorithm is needed. Thus, the analysis manager (cf. Figure 7.2) can be used, once the model for the prediction is known. The concrete implementation can then
184 8 Results & Evaluation
fire an event, which is observed by a schedule and triggers the notification. Therefore, the requirement is not fulfilled because no generic algorithm to pre-dict an upcoming situation is available. Neverthe-less, the system provides the functionality to notify a user, once the model is implemented.
PR-02 Usage of third party prescriptive analytic tools The available JDBC driver (cf. section 7.4) and the defined query language (cf. section 5.3.3) allow the retrieval of data and other analytical results.
DC-01 Data sources (CSV, XML, DBMS, or JSON) The system supports the definition of so called data retrievers (cf. section 7.1 and 7.2.1). By de-fault, a data retriever for CSV files and DBMS que-ries is provided. Additional retrievers can be easily implemented against the provided interface.
DC-02 JDBC driver, query language, and bulk load-ing The feature requests statements to insert or delete records. Both statements are supported by the DML of the query language presented in section 5.3.1. In addition, bulk loading is described in this section and the JDBC driver is shortly introduced in section 7.4.
Comments The UPDATE and DELETE statement need the user to specify a record identifier. The identifier can be retrieved from the result-set of an INSERT-statement or using a SELECT RECORDS state-ment. It would be nice to update or delete records by specifying criteria based on the descriptor val-ues. The presented query language and its processing do not support any type of transactions. A record inserted, updated, or deleted is processed by the system as an atomic operation. Nevertheless, roll-backs needed after several operations have to be performed manually. Thus, it would be better if the system supports transactions.
8.1 Requirements & Features 185
DC-03 Pre-aggregates The definition of pre-aggregates is currently not supported by the system. Performance tests showed that there is, so far, no need for a support. Nevertheless, in the future such pre-aggregates may be needed to increase the performance. Thanks to the bitmap-based data representation (cf. section 7.3.2), as well as the caching and stor-age implementation such pre-aggregates (cf. sec-tion 7.3.3) can be easily added.
DI-01 Complex data structures and many-to-many relationships The support of complex data structures is achieved by several functionalities. First of all, a complex data structure can be pre-processed using scripts as defined in section 7.2.1. In addition, it is possible to extend the system and provide domain-specific descriptors (cf. section 4.2). The support of many-to-many relationships is achieved by allowing a de-scriptive mapping function to map a value to sev-eral descriptor values.
DI-02 Validation of descriptive values The system provides several possibilities to vali-date a descriptive value (cf. section 4.2 or 7.3.1). The feature specifies several strategies, used to validate descriptive values. These strategies are implemented and selectable by configuration (cf. section 7.2.1).
DI-03 Validation of intervals The validation of intervals is possible by applying available strategies (cf. section 7.2.1) or imple-menting own time axis handlers (cf. section 7.3.1).
DI-04 Pre-processing using scripts Scripts can be utilized to pre-process raw data, prior to the integration of the records into the sys-tem. A pre-processor can be defined via configura-tion as described in section 7.3.1.
MA-01, MA-02
Apply models and schedule analysis
186 8 Results & Evaluation
The scheduler and event manager introduced in section 7.1 is used to apply models or schedule an analysis. The configuration of the component is described in section 7.2.1.
VIS-01 JDBC driver for third party BI tools or visuali-zations As stated in section 7.4, a JDBC driver is available and tested regarding the usage of it within BI tools.
VIS-02 Subscribe to alerts The feature requests a GUI useful to define sched-ules. Schedules are currently only definable within the configuration (cf. section 7.1). The feature was not realized, because comparing the implementa-tion effort and the benefit are not sufficient.
VIS-03 User management The GUI provides a user management as illus-trated in Figure 7.20.
VIS-04 Minimal GUI to request query results The GUI utilizes a line and a Gantt-chart to visual-ize the results of time series or record queries (cf. Figure 7.20).
VIS-05 JSON interface The development of the GUI presented in section 7.4 was performed separately from the implemen-tation of the back end. Nevertheless, the available JSON interface is utilized by the GUI.
The table shows that two of the features requested, DC-03 and VIS-02,
were not realized within the presented information system. As already men-
tioned, the feature DC-03 is currently not needed, but may be added in the
future. Thanks to the bitmap-based index, pre-aggregates can easily be
calculated for often used filters or selected members of a level of a hierar-
chy of a dimension. To keep the pre-aggregates up-to-date, it is necessary
to add a mechanism allowing for the determination of the pre-aggregates
8.2 Performance 187
to be updated when a change occurs (e.g., an insert or update is per-
formed73). The VIS-02 is not realized because of the effort needed to imple-
ment such a GUI element (within a research project).
8.2 Performance
In this section, several performance tests regarding runtime and memory
usage are presented. All tests were performed on an Intel Core i7-4810MQ
with a CPU clock rate of 2.80 GHz, 32 GB of main memory, an SSD, and
running 64-bit Windows 8.1 Pro. As Java implementation, we used a 64-bit
JRE 1.6.45, with XMX 4,096 MB and XMS 512 MB. The tests were per-
formed on the datasets listed in Table 8.2. Some tests used additional da-
tasets not shown in the table. These sets are introduced in the context of
the test.
Table 8.2: List of algorithms used to calculate the different aggregated values.
Dataset (type)
features of the model and dataset
raw records (∅ interval
length)
time axis (granularity &
amount of granules)
descriptors (list of descriptors &
cardinality, i.e., amount of descriptor values)
gh74 (real-world)
1,122,097 (∅ 42 min)
one year minutes 525,600
person (713), task-type (4),
work-area (31)
phone-calls (real-world)
63,825 (∅ 47 min)
two years minutes
1,051,200
caller (77), recipient (981), origin (50), desti-
nation (246)
The performed tests are organized in the following sections; tests re-
garding the performance of high performance collections are presented in
section 8.2.1, tests measuring the loading performance of the systems are
73 A delete may not have to be considered, if the tombstone index is applied to the pre-aggre-
gates. 74 Available online at: https://www.researchgate.net/publication/267979679
188 8 Results & Evaluation
summarized in section 8.2.2, results retrieved from tests concerning the
selection performance are shown in section 8.2.3, the results of the dis-
tance performance measurements are outlined in section 8.2.4, and the
tests evaluating the performance of the system in comparison to other pro-
prietary systems are presented in section 8.2.5.
8.2.1 High Performance Collections
As mentioned in section 7.3.2, the system utilizes high performance col-
lections to increase performance when retrieving indexed descriptors or
bitmaps. The implemented default factory has to decide which index to pro-
vide for a specific setting. Thus, several tests using high performance col-
lections and the Java default collections were performed, to pick the best
suited collection for a specific primitive data type and operation (i.e., re-
trieve, insert, or check containment). The test added, retrieved, or checked
the containment of 1,000,000 created descriptors. The descriptors were in-
dexed by different primitive data types, i.e., int or long75. Each test was per-
formed ten times and the average values are presented in Figure 8.1.
Figure 8.1: The results of the tests regarding the high performance collections for int and long data types.
75 The test was also performed for byte and short. Nevertheless, the results were similar and therefore are not presented.
8.2 Performance 189
As illustrated, the Trove implementation outperforms all other high perfor-
mance collections, as well as the Java collections. Thus, the factory selects
Trove’s high performance collections when indexing byte, short, int, or long
values. Whenever a string value is used as key, a default HashMap is
picked.
8.2.2 Load Performance
To measure the load performance the gh dataset was used. The perfor-
mance of the insertion of data is highly dependent on the used caching.
Thus, the tests were performed using:
– no cache, i.e., everything was inserted in main memory,
– a file-based storage with a cache configured to use a RR cache algo-
rithm76, 20 % cleaning factor if the cache overflows, and a maximum
size of 100,000 objects per cache, and
– a file-based storage with a cache configured to use a RR cache algo-
rithm, 80 % cleaning factor if the cache overflows, and a maximum size
of 500,000 objects per cache.
Furthermore, the performance was measured by adding 1,000,000 records
using several bulk-loads in different sizes, i.e., the tests were performed
using 100 chunks of 10,000, 20 chunks of 50,000, 10 chunks of 100,000,
5 chunks of 200,000, 2 chunks of 500,000, and finally 1 chunk of 1,000,000
records. Figure 8.2 illustrates the results of the load performance tests. The
runtime performance of the memory and large file-based cache with high
clean-up rate (i.e., File, 80 %, RR, 500k) stays almost constant in all sce-
narios (i.e., 0.046 ms per record in the case of the memory and 0.061 ms
using the file-based cache). The runtime performance of the smaller cache
with a 20 % clean-up rate leads to a lot of write operations on the hard
drive, because the cache overflows more often. The small clean-up rate,
76 Other tests performed showed that the RR algorithm is the best choice. All statistic-based
algorithm were must of the time busy updating the statistic. Nevertheless, depending on the scenario these algorithms may be suited better, e.g., when retrieval is more important than loading.
190 8 Results & Evaluation
leads to several overflows within one bulk load. In addition, the random
replacement removes entities, which might be needed within the same
load. However, even the average worst runtime performance (i.e., around
1 ms), enables the system to process 1,000 records per second.
Figure 8.2: The results of the load performance tests.
Regarding the memory usage, the tests show that the caches perform as
expected. Considering the garbage collector of Java and the settings used
for the tests, the results have to be considered with care. The decision
whether the garbage collector has to run or not is decided by the Java
Runtime Environment. Thus, elements may stay in main memory, because
garbage collection did not remove them yet (cf. Jones et al. (2012)).
8.2.3 Selection Performance
To evaluate the runtime performance of the bitmap-based algorithm for the
processing of select statements, three additional algorithms, not support-
ing group by, multiple measures, storage, multi-threading, or generic di-
mensions (i.e., the used dimensions are hard-coded), were implemented,
i.e.:
– a naïve algorithm (performing a sequential scan), which is used as
baseline,
8.2 Performance 191
– a main memory IntervalTree-based algorithm, which fills the tree once
with the raw records, and
– another main memory IntervalTree-based algorithm, whereby this im-
plementation creates an intermediate filtered tree.
The naïve algorithm, named Naïve, is shown in Listing 8.1. The algo-
rithm filters the records according to the specified filter criteria of the query,
i.e., the time window and the logical expression defined in the WHERE
clause (cf. line 04). Next, the algorithm determines the ranges defined by
the dimension and the time window and iterates over each partition (cf. line
06). In each iteration the algorithm determines the records for the current
range by filtering the previously created set of records (cf. line 08). The
algorithm calculates the value of the measure based on the set of records
and the current range. The calculated value is assigned to the time series
(cf. line 12).
Listing 8.1: The naïve algorithm.
01: TimeSeries naive(Query q, Set r) {
02: TimeSeries ts = new TimeSeries(q);
03: // filter time defined by IN [a, b] and WHERE expression
04: r = filter(r, q.time(), q.where());
05: // iterate over the ranges defined by IN and ON
06: for (TimeRange i : q.ranges()) {
07: // filter records for the range
08: r’ = filter(r, i);
11: // determine measures defined by OF
12: ts.set(i, calc(i, r’, q.meas());
13: }
14: return ts;
15: }
The first IntervalTree-based algorithm, named IntTreeA, works similar
to the naïve algorithm. Instead of retrieving a set of records (cf. line 01 of
the Naïve algorithm), the algorithm receives an IntervalTree. The tree is
used to apply the filter criteria defined by the time window. The resulting
set of records is filtered by the logical expression defined by the WHERE
192 8 Results & Evaluation
clause and processed as in the Naïve algorithm (cf. line 06 - 15). The sec-
ond IntervalTree-based algorithm, named IntTreeB is shown in Listing 8.2.
The algorithm receives the records organized in the IntervalTree and cre-
ates an intermediate tree containing all the filtered records (cf. line 04).
Afterwards, the algorithm iterates over the defined ranges and uses the
intermediate tree to pick the records valid for the specified range (cf. line
08). Within each iteration the measure for the selected subset is calculated
and set (cf. line 12). After the iteration is finished, the algorithm returns the
created time series.
Listing 8.2: The IntTreeB algorithm.
01: TimeSeries intTreeB(Query q, IntervalTree tree) {
02: TimeSeries ts = new TimeSeries(q);
03: // filter time defined by IN [a, b] and WHERE expression
04: IntervalTree interTree = filter(tree, q.time(), q.where());
05: // iterate over the ranges defined by IN and ON
06: for (TimeRange i : q.ranges()) {
07: // use the tree to get the filtered records within the range
08: Set records = filter(interTree, i);
11: // determine measures defined by OF
12: ts.set(i, calc(i, records, q.meas());
13: }
14: return ts;
15: }
All four algorithms were tested with the same Java settings, using the
gh and phone-calls database. To assess the performance for different sizes
of datasets, the gh database was used to create several subsets, i.e.,
10,000, 100,000, and 1,000,000. In addition, several different query types,
regarding the selectivity, filter criteria, and measure complexity were pro-
cessed. Table 8.3 gives an overview over the different queries fired against
the different datasets, showing the characteristics of each query and da-
taset combination. The categories simple, average, and high used for the
complexity are indicators, helpful to classify the results.
8.2 Performance 193
Table 8.3: Overview over the different tests performed to validate the runtime per-formance.
nr. dataset #selected records
#total re-cords
measure comple-
xity filter com-
plexity
COUNT(1); -; [01.01.2008, 01.02.2008); WORKAREA.LOC.TYPE='Gate'
#1a gh 147 10,000 simple average #1b gh 1,572 100,000 simple average #1c gh 15,391 1,000,000 simple average
MAX(COUNT(1)); TIME.DEF.DAY; [01.01.2008, 01.02.2008); TASKTYPE='short'
#2a gh 503 10,000 average simple #2b gh 5,058 100,000 average simple #2c gh 51,461 1,000,000 average simple
MAX(SUM(PERSON) / COUNT(1)); TIME.DEF.MIN5; [01.01.2008, 01.02.2008); TASKTYPE='short'
#3a gh 102 10,000 high simple #3b gh 995 100,000 high simple #3c gh 9,727 1,000,000 high simple
MIN(TASKTYPE); TIME.DEF.MIN5; [01.01.2008, 01.02.2008); WORKAREA.LOC.TYPE='Ramp' OR PERSON='*9'
#4a gh 99 10,000 average average #4b gh 1,002 100,000 average average #4c gh 9,912 1,000,000 average average
SUM(COUNT(1)); TIME.DEF.MIN5; [01.01.2008, 01.02.2008); (WORKAREA.LOC.TYPE='Ramp' OR PERSON='*9') AND (TASKTYPE='long' OR TASKTYPE='short')
#5a gh 79 10,000 high high #5b gh 846 100,000 high high #5c gh 8,375 1,000,000 high high
SUM(TASKTYPE); TIME.DEF.DAY; [01.01.2008, 01.02.2008); WORKAREA='BIE.W03' AND (TASKTYPE='long' OR TASKTYPE='very long')
#6a gh 79 10,000 simple high #6b gh 846 100,000 simple high
194 8 Results & Evaluation
#6c gh 8,375 1,000,000 simple high
COUNT(1); -; [01.01.2014, 01.02.2013); -;
#7 phonecalls 10,583 63,825 simple simple
MAX(SUM(CALLER)); TIME.DEF.DAY; [01.01.2014 00:00:00, 01.02.2013); ORIGIN='Kansas'
#8 phonecalls 493 63,825 average simple
SUM(COUNT(1)); TIME.DEF.DAY; [01.08.2013, 01.08.2013); CALLER='L*' AND (RECIPIENT='A*' OR RECIPIENT='M*')
#9 phonecalls 2,877 63,825 average high
The results of the runtime performance tests are depicted in Figure 8.3
and a detailed list showing all measured runtimes can be found in the ap-
pendix. The results show that the implementation presented in this book
(i.e., chapter 7) outperforms the other implementations in all tests, except
for #4a, #4b, #5a, #6a, and #6b. The reasons lie above all in the low selec-
tivity of these queries (i.e., the ratio between the amount of selected and
total records is small) and the fact that the records remained in main
memory during the test. If a storage would be utilized, the IntTreeB algo-
rithm would have to retrieve each record from the second memory and val-
idate the filter. On the other hand, the TIDA algorithm can apply the filters
using bitmaps, which also may have to be retrieved from the second
memory. Nevertheless, the retrieval of a bitmap (including information
about all records for the filtered attribute) is performed faster than any rec-
ord retrieval (cf. Abdelouarit et al. (2013)). In addition, applying the TIDA
algorithm to larger datasets (e.g., #1-6c, #7, #8, and #9) shows the su-
premacy of the bitmap-based implementation.
8.2 Performance 195
Figure 8.3: The results of the selection tests for the different queries shown in Table 8.3.
196 8 Results & Evaluation
8.2.4 Distance Performance
Within this section, the performance of the algorithms used to determine
the temporal order, relational, and measure distance is evaluated. The tem-
poral order distance is compared to the IBSM algorithm introduced by Ko-
tsifakos et al. (2013) (cf. section 3.5). The performance of the measure and
relational distance is not compared, because an implementation of the
ARTEMIS algorithm was not available when requested from Kostakis et al..
Within the test, the gh dataset was used, searching for the 1-NN, 3-NN, 5-
NN, and 10-NN. The search was performed against the whole dataset. The
source (i.e., the subset) for each test was selected for each type of dis-
tance, i.e., for the order a day was randomly picked, for the measure dis-
tance the following measures were calculated: (1) SUM on lowest granu-
larity for a day searching for similar days, and (2) MAX-COUNT on day level
for a month searching for similar months, and regarding the relational dis-
tance a day was randomly picked and filtered for a specific type and work-
area. In addition, the results of IBSM and the presented bitmap-based al-
gorithm to determine the temporal order distance were compared, and the
resulting nearest neighbors were equal for all tests. Figure 8.4 illustrates
the results of the tests, i.e., the measured runtime, as well as the 3-NN for
the temporal order and the two temporal measure similarities. A visualiza-
tion of the 3-NN of the temporal relational similarity can be found in the
appendix (cf. 3-NN of the Temporal Relational Similarity). The bitmap-
based implementation outperforms the IBSM implementation. The figure
also shows that the abortion criterion increases the performance slightly.
8.2 Performance 197
Figure 8.4: Illustration of the performance tests regarding the distance calcula-tion, as well as the results of the temporal order and measure simi-larity; a visualization of the relational similarity can be found in the appendix.
8.2.5 Proprietary Solutions vs. TIDAIS
In this chapter, the runtime performance of other proprietary solutions and
the presented TIDAIS regarding the provision of answers to questions is
measured. The following proprietary solutions are validated:
– icCube 5.1 Community Edition77,
– Oracle DBMS 12c Enterprise Edition with OLAP option installed78, and
77 http://www.iccube.com 78 https://www.oracle.com/database/index.html
198 8 Results & Evaluation
– TimeDB 2.279.
Within the following the different solutions are shortly introduced and the
selection motivated:
– icCube is a high performance and real-time analytical engine. It can be
used to combine and transform data from multiple sources and answer
MDX queries. It supports non-strict, non-onto, and non-covering hier-
archies, and the usage of time intervals for dimensions. Nevertheless,
the lowest selected granularity is half-hour and hour. However, the low-
est supported and working granularity using the test datasets was day
(i.e., using hour or half-hour led to crashes). icCube is chosen as rep-
resentative for the MOLAP based data storage analytical systems.
– The Oracle DBMS 12c is one of the most sophisticated DBMS. It sup-
ports many-to-many relationships (i.e., non-strict hierarchies), by solv-
ing summarizability issues on a physical and logical level (e.g., follow-
ing ideas presented by Song et al. (2001)). The Oracle DBMS is se-
lected, because of the ROLAP based data storage, as well as the pos-
sibility to analyze data using PL/SQL.
– The TimeDB solution is a product of the temporal database community
supporting ATSQL2 as query language. It does not support dimen-
sional models. Nevertheless, several questions can be answered using
the tool, which is backed by an Oracle DBMS. Thus, the performance
of the tool is mainly influenced by the utilization of the Oracle DBMS.
To ensure fairness among the different systems, all Java based imple-
mentations (i.e., icCube, TimeDB, and TIDAIS) used a maximum of 512 MB
main memory and second memory for data storage. In addition, the used
kernel time of the tools was measured to parse the query, determine, and
return the results80. Furthermore, the caches were reset prior to each test
79 http://www.timeconsult.com/Software/Software.html 80 To achieve that with the Java implementations, the kernel time of the different threads was
measured using ThreadMXBean. In Oracle the session was altered using "ALTER SESSION SET EVENT '10046 trace name context forever, level 12';" and evaluated using TKPROF.
8.2 Performance 199
series, which consists of 20 query requests mixed other queries in be-
tween. The Oracle DBMS was set up with indexes and partitions to in-
crease the query performance. Several Oracle based results (i.e., #1, #3,
and #5) were confirmed by an Oracle specialist81 and further optimization
strategies were discussed and applied if possible, e.g., data pre-pro-
cessing, hybrid columnar compression (HCC), in-memory tables, flash
cache, and external tables. Nevertheless, most of these optimizations were
not applicable within the tests, because an Oracle Exadata solution82 would
be needed, which is out of scope regarding an applicable solution to ana-
lyze time interval data.
Table 8.4 shows the performed tests and the possibility to run each test
using the different solutions. The table shows two Oracle solutions, one
using the available OLAP option, the other one using a specifically written
PL/SQL statement to be queried against the raw dataset. Each test utilized
the introduced ground-handling dataset gh (cf. section 8.2). In addition, a
third dataset, based on the gh dataset, was created, to test the icCube
solution utilizing day granularity. The created dataset, named ghday, con-
tains the same amount of intervals and the same descriptive values as the
gh dataset. The minute granularity was resolved to a day granularity, keep-
ing the average duration. Nevertheless, the amount of chronons is reduced
to 366 (2008 was a leap year).
81 Eric Emrick worked as Oracle DBA for several years and data analysis focusing on Oracle
technologies. After performing and analyzing several tests using the Oracle DBMS and the provided data sets he stated that the "Tida results are all the more impressive" regarding the fast data retrieval.
82 The Exadata solution is the premium Oracle solution not available under 300,000 $ (basic version, hardware included, cf. Ronald Weiss (2012) and Oracle Cooperation (2015)).
200 8 Results & Evaluation
Table 8.4: List of tests performed in the category "Proprietary Solutions vs. TIDAIS".
# question data-set
Oracle OLAP
OraclePL/SQL icCube TimeDB TIDAIS
1
How many tasks were performed on each day of the year 2008 per task-type?
ghday 83
2
How many tasks did each person execute per day in March?
ghday 83
3
How many re-sources are needed within each hour on the 2008-12-05?
gh
4 How many hours are worked per day in January?
gh
5 What was the maximal amount of active re-sources be-tween 2008-01-20 and 2008-01-25?
gh
The results of the tests are shown in Figure 8.5. As illustrated, the im-
plementation presented in this book outperforms the other proprietary
tools. In general, the Oracle (OLAP) solution performs best compared to
the TIDAIS. The performance regarding query #4 is explained by the
amount of records to be evaluated. The poor performance on the raw rec-
ords using PL/SQL scripts is explained by the use of a pipelined table,
which is used to generate the chronons and ensure that all time points are
83 The query was created and combined (i.e., appended to each other) programmatically.
8.3 Summary 201
covered within the result. Joining data with the virtual pipelined table is ex-
pansive and thereby slow. Nevertheless, the used PL/SQL queries only
need the existence of the PL/SQL function and data types presented in the
appendix (cf. Pipelined Table Function (PL/SQL Oracle)) and no additional
pre-processing.
Figure 8.5: Performance results of the queries used to answer the questions shown in Table 8.4.
8.3 Summary
In this chapter the TIDAIS, which is based on the presented TIDAMODEL and
the introduced bitmap-based indexes, was evaluated. The requested fea-
tures were checked regarding their fulfillment, including the usability of the
TIDAQL. The performance of the system was tested with respect to memory
usage and runtime performance. The results show that the system outper-
forms current state-of-the-art solutions. In general, the evaluation also
shows that the presented holistic solution based on bitmap-based indexes
is, in the majority of cases, faster than specialized data structures and al-
gorithm.
9 Summary and Outlook
Time interval data is ubiquitous and the requirement of analyzing such data
arises more and more frequently. In this book, an information system was
introduced, which enables the user to analyze time interval data using
known techniques like OLAP, data mining, or similarity searches. The heart
of the presented system are bitmap-based index structures, enabling the
system to process queries using the presented query language. The sys-
tem is the first system focusing on this type of data and it outperforms other
data analysis tools. Furthermore, the evaluations have shown that the bit-
map-based algorithms outperform techniques like the IntervalTree or
IBSM.
The RQs mentioned at the beginning of the book are answered within
the different sections. The needed features of an information system (cf.
RQ1) are presented in section 2.2. The aspects to be covered by a model
for time interval data analysis (cf. RQ2) are introduced in chapter 4 and
motivated by the already mentioned features, as well as an extended liter-
ature research (cf. chapter 3). Chapter 5 introduces the answer to RQ3,
which deals with the definition of a query language to enable time interval
data analysis. The indexing techniques presented in section 7.3.2 enable
the system to process the formulated queries fast, as shown in chapter 8.
In addition, the book discusses extensively the possibilities to utilize caches
and storage (cf. section 7.3.3). These aspects, i.e., indexing, caching, and
persistency, are addressed by RQ4, which is answered by the presented
results. The similarity search mentioned in RQ5 is enabled by the distance
measures introduced in chapter 6 and section 7.3.5. The last question,
RQ6, is answered within sections 7.1 (architecture), 7.2 (configuration), as
well as 7.3 and 7.4 (interfaces).
Nevertheless, the presented solution has some limitations regarding the
processing of queries, which are: (1) the concurrent processing of a query
as an atomic instance, (2) the processing of data on lowest granularity, and
(3) the constraints of the used bitmap implementations. In the following,
© Springer Fachmedien Wiesbaden GmbH 2016P. Meisen, Analyzing Time Interval Data,DOI 10.1007/978-3-658-15728-9_9
204 9 Summary and Outlook
each of these limitations is explained shortly and in the next paragraph
additional future research topics based on these limitations are listed. Re-
garding the processing of a query (cf. (1)), the system is capable to process
queries in parallel, but the system is not capable to split the processing of
a single query, e.g., to enable a distributed single query processing. The
system is also limited regarding the processing of each query on lowest
granularity (cf. (2)). Whenever a query is processed, the algorithm retrieves
and combines the bitmaps on lowest granularity. The usage and intelligent
creation of pre-aggregations is not supported. Finally, the used bitmap im-
plementations (cf. (3)) are generally design for the purpose of logically com-
bining bitmaps. A specific implementation for the presented use case is not
discussed or introduced. Thus, performance increases by using specifically
design implementations may be possible.
Based on these limitations, user feedback (cf. section 8.1), and current
research topics several research questions are still open and should be
addressed in future research. An important question is how the system can
utilize techniques like load balancing or distributed query processing. The
partition of the time axis and distribution of the different bitmaps within a
cluster should be investigated. Furthermore, research regarding the visu-
alization of time interval data should be focused. Another topic is the en-
hancement of the algorithm used for similarity search. Regarding the tem-
poral relational distance, potential for optimization is given. Last but not
least, the development of special mining techniques using, e.g., OLAM
techniques, should be investigated. The usage of time depended infor-
mation like vacation periods, global events, or local events, may reveal new
patterns aligned to these temporal artifacts. In addition, the mentioned, un-
fulfilled feature regarding the calculation and provision of pre-aggregates
should be implemented and discussed in the near future. The automatic
generation of such pre-aggregates by the system, i.e., by learning from the
users’ queries, should be researched.
Appendix
Pipelined Table Functions (PL/SQL Oracle)
DROP TYPE T_DATE;
DROP TYPE T_DATE_ROW;
CREATE TYPE T_DATE_ROW AS OBJECT (start_date DATE, end_date DATE);
/
CREATE TYPE T_DATE IS TABLE OF T_DATE_ROW;
/
CREATE OR REPLACE FUNCTION
F_DATES(start_date IN DATE, end_date IN DATE) RETURN T_DATE AS
res T_DATE := T_DATE();
diff pls_integer;
cur DATE;
nxt DATE;
BEGIN
diff := (end_date ‐ start_date) * 24 * 60;
cur := start_date;
nxt := NULL;
FOR i IN 1 .. diff LOOP
nxt := start_date + (i / 24 / 60);
res.extend;
res(res.last) := T_DATE_ROW(cur, nxt);
cur := nxt;
END LOOP;
RETURN res;
END;
/
© Springer Fachmedien Wiesbaden GmbH 2016P. Meisen, Analyzing Time Interval Data,DOI 10.1007/978-3-658-15728-9
206 Appendix
A Complete Sample Model-Configuration-File
<?xml version="1.0" encoding="UTF‐8" standalone="no"?>
<!‐‐ offlinemode (i.e. what should happen if a dataretriever is not available)
is optional and can be one of the following values (case‐insensitive):
+ true, y, yes
+ false, n, no
+ auto
‐‐>
<model xmlns="http://dev.meisen.net/xsd/dissertation/model"
xmlns:advDes="http://dev.meisen.net/xsd/dissertation/model/advancedDescriptors"
xmlns:idx="http://dev.meisen.net/xsd/dissertation/model/indexes"
xmlns:map="http://dev.meisen.net/xsd/dissertation/model/mapper"
xmlns:dim="http://dev.meisen.net/xsd/dissertation/dimension"
xmlns:spp="http://dev.meisen.net/xsd/dissertation/preprocessor/script"
xmlns:xsi="http://www.w3.org/2001/XMLSchema‐instance"
xsi:schemaLocation="
http://dev.meisen.net/xsd/dissertation/model
http://dev.meisen.net/xsd/dissertation/tidaModel.xsd
http://dev.meisen.net/xsd/dissertation/model/indexes
http://dev.meisen.net/xsd/dissertation/tidaIndexFactory.xsd
http://dev.meisen.net/xsd/dissertation/model/advancedDescriptors
http://dev.meisen.net/xsd/dissertation/tidaAdvancedDescriptors.xsd
http://dev.meisen.net/xsd/dissertation/model/mapper
http://dev.meisen.net/xsd/dissertation/tidaMapperFactory.xsd
http://dev.meisen.net/xsd/dissertation/preprocessor/script
http://dev.meisen.net/xsd/dissertation/tidaScriptPreProcessor.xsd
http://dev.meisen.net/xsd/dissertation/dimension
http://dev.meisen.net/xsd/dissertation/tidaDimension.xsd"
offlinemode="false" folder="_data/fullModel"
id="fullModel" name="My wonderful Model">
<config>
<caches>
<!‐‐ Define the cache to be used for metadata.
The following cache implementations are available:
+ net.meisen.dissertation.impl.cache.MemoryIdentifierCache
+ net.meisen.dissertation.impl.cache.FileIdentifierCache
‐‐>
<identifier
implementation="net.meisen.dissertation.impl.cache.MemoryIdentifierCache" />
<!‐‐ Define the cache to be used for meta‐information (i.e. the descriptors).
The following cache implementations are available:
+ net.meisen.dissertation.impl.cache.MemoryMetaDataCache
+ net.meisen.dissertation.impl.cache.FileMetaDataCache
‐‐>
<metadata
implementation="net.meisen.dissertation.model.descriptors.mock.MockMetaDataCache" />
<!‐‐ Define the cache to be used for bitmaps.
A Complete Sample Model-Configuration-File 207
The following cache implementations are available:
+ net.meisen.dissertation.impl.cache.MemoryBitmapCache
+ net.meisen.dissertation.impl.cache.MapDbBitmapCache
+ net.meisen.dissertation.impl.cache.FileBitmapCache
‐‐>
<bitmap implementation="net.meisen.dissertation.impl.cache.MemoryBitmapCache" />
<!‐‐ Define the cache to be used for fact‐sets.
The following cache implementations are available:
+ net.meisen.dissertation.impl.cache.MemoryFactDescriptorModelSetCache
+ net.meisen.dissertation.impl.cache.FileFactDescriptorModelSetCache
+ net.meisen.dissertation.impl.cache.MapDbFactDescriptorModelSetCache
‐‐>
<factsets
implementation="net.meisen.dissertation.impl.cache.
MemoryFactDescriptorModelSetCache" />
<!‐‐ Define the cache to be used for records.
The following cache implementations are available:
+ net.meisen.dissertation.impl.cache.IdsOnlyDataRecordCache
+ net.meisen.dissertation.impl.cache.MemoryDataRecordCache
+ net.meisen.dissertation.impl.cache.MapDbDataRecordCache
‐‐>
<records
implementation="net.meisen.dissertation.impl.cache.IdsOnlyDataRecordCache" />
</caches>
<factories>
<!‐‐ Define the factory to be used to determine which IndexFactory to be used.
The following factories are available by default:
+ net.meisen.dissertation.impl.indexes.IndexFactory
‐‐>
<indexes implementation="net.meisen.dissertation.impl.indexes.IndexFactory">
<!‐‐ Define the different indexes to be used.
The following bitmap‐indexes are available by default:
+ net.meisen.dissertation.impl.indexes.datarecord.slices.EWAHBitmap
+ net.meisen.dissertation.impl.indexes.datarecord.slices.RoaringBitmap
The following implementations are by default available for specific
primitive data types:
+ net.meisen.dissertation.impl.indexes.FastUtilIntIndexedCollection
+ net.meisen.dissertation.impl.indexes.FastUtilLongIndexedCollection
+ net.meisen.dissertation.impl.indexes.HppcIntIndexedCollection
+ net.meisen.dissertation.impl.indexes.HppcLongIndexedCollection
+ net.meisen.dissertation.impl.indexes.TroveByteIndexedCollection
+ net.meisen.dissertation.impl.indexes.TroveShortIndexedCollection
+ net.meisen.dissertation.impl.indexes.TroveIntIndexedCollection
+ net.meisen.dissertation.impl.indexes.TroveLongIndexedCollection
‐‐>
<idx:config
bitmap="net.meisen.dissertation.impl.indexes.datarecord.slices.
EWAHBitmap"
byte="net.meisen.dissertation.impl.indexes.TroveByteIndexedCollection"
short="net.meisen.dissertation.impl.indexes.TroveShortIndexedCollection"
208 Appendix
int="net.meisen.dissertation.impl.indexes.TroveIntIndexedCollection"
long="net.meisen.dissertation.impl.indexes.TroveLongIndexedCollection" />
</indexes>
<!‐‐ Define the factory to be used to determine which MapperFactory
to be used.
The following factories are available by default:
+ net.meisen.dissertation.impl.time.mapper.MapperFactory
‐‐>
<mappers implementation="net.meisen.dissertation.impl.time.mapper.MapperFactory">
<!‐‐ The inheritance of the default mappers (default means in that case the once
defined in the global configuration) can be true or false.
‐‐>
<map:config inheritDefault="true">
<!‐‐
Adds mappers to the default once. The default once cannot be removed and are:
+ net.meisen.dissertation.impl.time.mapper.DateMapper
+ net.meisen.dissertation.impl.time.mapper.LongMapper
‐‐>
<map:mapper
implementation="net.meisen.dissertation.impl.time.mapper.DateMapper" />
<map:mapper
implementation="net.meisen.dissertation.impl.time.mapper.LongMapper" />
</map:config>
</mappers>
<!‐‐
Adds schedules executed if the model is loaded. By default, the following
jobs are available:
+ net.meisen.dissertation.impl.scheduler.QueryJob
+ net.meisen.dissertation.impl.scheduler.SendEmailJob
‐‐>
<schedules>
<schedule cron="*/15 4‐16 * * 6,7"
implementation=" net.meisen.dissertation.impl.scheduler.QueryJob">
<qj:query>SELECT COUNT(RECORDS) FROM myModel</qj:query>
</schedule>
<schedule event="core:query"
implementation=" net.meisen.dissertation.impl.scheduler.SendEmailJob" />
</schedules>
</factories>
<!‐‐ Define the pre‐processer used to modify the incoming record.
The following implementations are available by default:
+ net.meisen.dissertation.impl.dataintegration.IdentityPreProcessor
+ net.meisen.dissertation.impl.dataintegration.ScriptPreProcssor
‐‐>
<preprocessor
implementation="net.meisen.dissertation.impl.dataintegration.ScriptPreProcessor">
<spp:script language="javascript">
/*
* Here is my script:
* ‐ the script gets the raw‐record injected as raw
A Complete Sample Model-Configuration-File 209
* ‐ the script must set a result with an IDataRecord instance
* ‐ the script should not modify the raw record
*/
var result = raw;
</spp:script>
</preprocessor>
</config>
<time>
<timeline start="20.01.1981" duration="100" granularity="YEAR" />
</time>
<meta>
<!‐‐ As identifier‐factory the following implementations are available:
+ net.meisen.dissertation.impl.idfactories.IntegerIdsFactory
+ net.meisen.dissertation.impl.idfactories.LongIdsFactory
+ net.meisen.dissertation.impl.idfactories.UuIdsFactory
The null‐attribute (true or false) defines if null values are allowed within
the model.
The failonduplicates‐attributes (true or false) specifies if duplicates are
just ignored or if an exception is thrown.
‐‐>
<descriptors>
<string id="R1" failonduplicates="true" null="false" name="person"
idfactory="net.meisen.dissertation.impl.idfactories.UuIdsFactory" />
<string id="R2" null="false" name="toy"
idfactory="net.meisen.dissertation.impl.idfactories.UuIdsFactory" />
<string id="R3" null="true" />
<string id="D1" name="funFactor" />
<integer id="D2" name="smiles" />
<string id="D3" />
<advDes:list id="D4"
idfactory="net.meisen.dissertation.impl.idfactories.LongIdsFactory" />
</descriptors>
<entries>
<entry descriptor="R1" value="Philipp" />
<entry descriptor="R1" value="Debbie" />
<entry descriptor="R1" value="Edison" />
<entry descriptor="R2" value="rattle" />
<entry descriptor="R2" value="teddy" />
<entry descriptor="R2" value="cup" />
<entry descriptor="R2" value="doll" />
<entry descriptor="D1" value="no" />
<entry descriptor="D1" value="low" />
<entry descriptor="D1" value="average" />
<entry descriptor="D1" value="high" />
<entry descriptor="D1" value="very high" />
<entry descriptor="D2" value="1" />
<entry descriptor="D2" value="2" />
210 Appendix
<entry descriptor="D2" value="3" />
<entry descriptor="D2" value="4" />
<entry descriptor="D2" value="5" />
<entry descriptor="D3" value="Some Value" />
<entry descriptor="D4" value="A,B,C" />
<entry descriptor="D4" value="D,E,F,G,H" />
<entry descriptor="D4" value="I" />
</entries>
</meta>
<dim:dimensions>
<dim:dimension id="PERSON" descriptor="R1">
<dim:hierarchy id="GENDER" all="All Persons">
<dim:level id="GENDER">
<dim:member id="MALE" reg="Philipp|Edison" rollUpTo="*" />
<dim:member id="FEMALE" reg="Debbie" rollUpTo="*" />
</dim:level>
</dim:hierarchy>
</dim:dimension>
<dim:dimension id="TOY" descriptor="R2">
<dim:hierarchy id="TYPE" all="All Types">
<dim:level id="TYPE">
<dim:member id="WOOD" reg="rattle" rollUpTo="*" />
<dim:member id="STUFF" reg="teddy|doll" rollUpTo="*" />
<dim:member id="MISC" reg="cup" rollUpTo="*" />
</dim:level>
</dim:hierarchy>
</dim:dimension>
</dim:dimensions>
<!‐‐ no data is added so we don’t need any structure ‐‐>
<structure />
<!‐‐ MetaDataHandling (i.e. what has to be done if no Descriptor is available so far)
is optional and can be one of the following values (case‐insensitive):
+ handleAsNull, null
+ createDescriptor, create, add
+ failOnError, fail
IntervalDataHandling (i.e. what has to be done if an interval‐value is null)
+ boundariesWhenNull, boundaries
+ useOther, other, others
+ failOnNull, fail
‐‐>
<data metahandling="create" intervalhandling="boundariesWhenNull" />
</model>
A Complete Sample Configuration-File 211
A Complete Sample Configuration-File
<?xml version="1.0" encoding="UTF‐8" standalone="no"?>
<config xmlns="http://dev.meisen.net/xsd/dissertation/config"
xmlns:xsi="http://www.w3.org/2001/XMLSchema‐instance"
xmlns:idx="http://dev.meisen.net/xsd/dissertation/model/indexes"
xmlns:map="http://dev.meisen.net/xsd/dissertation/model/mapper"
xsi:schemaLocation="http://dev.meisen.net/xsd/dissertation/config
http://dev.meisen.net/xsd/dissertation/tidaConfig.xsd
http://dev.meisen.net/xsd/dissertation/model/indexes
http://dev.meisen.net/xsd/dissertation/tidaIndexFactory.xsd
http://dev.meisen.net/xsd/dissertation/model/mapper
http://dev.meisen.net/xsd/dissertation/tidaMapperFactory.xsd">
<!‐‐ Define the location the server stores it's data at ‐‐>
<location folder="_data" />
<auth>
<!‐‐ Define the manager used for authentication:
The following cache implementations are available:
+ net.meisen.dissertation.impl.auth.AllAccessAuthManager
+ net.meisen.dissertation.impl.auth.shiro.ShiroAuthManager
‐‐>
<manager implementation="net.meisen.dissertation.impl.auth.AllAccessAuthManager" />
</auth>
<caches>
<!‐‐ Define the cache to be used for metadata.
The following cache implementations are available:
+ net.meisen.dissertation.impl.cache.MemoryIdentifierCache
+ net.meisen.dissertation.impl.cache.FileIdentifierCache
‐‐>
<identifier implementation="net.meisen.dissertation.impl.cache.MemoryIdentifierCache" />
<!‐‐ Define the cache to be used for metadata.
The following cache implementations are available:
+ net.meisen.dissertation.impl.cache.MemoryMetaDataCache
+ net.meisen.dissertation.impl.cache.FileMetaDataCache
‐‐>
<metadata implementation="net.meisen.dissertation.impl.cache.MemoryMetaDataCache" />
<!‐‐ Define the cache to be used for bitmaps.
The following cache implementations are available:
+ net.meisen.dissertation.impl.cache.MemoryBitmapCache
+ net.meisen.dissertation.impl.cache.MapDbBitmapCache
+ net.meisen.dissertation.impl.cache.FileBitmapCache
‐‐>
<bitmap implementation="net.meisen.dissertation.impl.cache.MemoryBitmapCache" />
212 Appendix
<!‐‐ Define the cache to be used for fact‐sets.
The following cache implementations are available:
+ net.meisen.dissertation.impl.cache.MemoryFactDescriptorModelSetCache
+ net.meisen.dissertation.impl.cache.FileFactDescriptorModelSetCache
‐‐>
<factsets
implementation="net.meisen.dissertation.impl.cache.MemoryFactDescriptorModelSetCache" />
<!‐‐ Define the cache to be used for records.
The following cache implementations are available:
+ net.meisen.dissertation.impl.cache.IdsOnlyDataRecordCache
+ net.meisen.dissertation.impl.cache.MemoryDataRecordCache
+ net.meisen.dissertation.impl.cache.MapDbDataRecordCache
‐‐>
<records implementation="net.meisen.dissertation.impl.cache.IdsOnlyDataRecordCache" />
</caches>
<factories>
<!‐‐ Define the factory to be used to determine which IndexFactory
to be used.
The following factories are available by default:
+ net.meisen.dissertation.impl.indexes.IndexFactory
‐‐>
<indexes implementation="net.meisen.dissertation.impl.indexes.IndexFactory">
<!‐‐ Define the different indexes to be used.
The following bitmap‐indexes are available by default:
+ net.meisen.dissertation.impl.indexes.datarecord.slices.EWAHBitmap
+ net.meisen.dissertation.impl.indexes.datarecord.slices.RoaringBitmap
The following implementations are by default available for specific
primitive data types:
+ net.meisen.dissertation.impl.indexes.FastUtilIntIndexedCollection
+ net.meisen.dissertation.impl.indexes.FastUtilLongIndexedCollection
+ net.meisen.dissertation.impl.indexes.HppcIntIndexedCollection
+ net.meisen.dissertation.impl.indexes.HppcLongIndexedCollection
+ net.meisen.dissertation.impl.indexes.TroveByteIndexedCollection
+ net.meisen.dissertation.impl.indexes.TroveShortIndexedCollection
+ net.meisen.dissertation.impl.indexes.TroveIntIndexedCollection
+ net.meisen.dissertation.impl.indexes.TroveLongIndexedCollection
‐‐>
<idx:config bitmap="net.meisen.dissertation.impl.indexes.datarecord.slices.EWAHBitmap"
byte="net.meisen.dissertation.impl.indexes.TroveByteIndexedCollection"
short="net.meisen.dissertation.impl.indexes.TroveShortIndexedCollection"
int="net.meisen.dissertation.impl.indexes.TroveIntIndexedCollection"
long="net.meisen.dissertation.impl.indexes.TroveLongIndexedCollection" />
</indexes>
<!‐‐ Define the factory to be used to determine which MapperFactory
to be used.
A Complete Sample Configuration-File 213
The following factories are available by default:
+ net.meisen.dissertation.impl.time.mapper.MapperFactory
‐‐>
<mappers implementation="net.meisen.dissertation.impl.time.mapper.MapperFactory">
<map:config>
<!‐‐
Adds mappers to the default once. The default once cannot be removed and are:
+ net.meisen.dissertation.impl.time.mapper.DateMapper
+ net.meisen.dissertation.impl.time.mapper.LongMapper
‐‐>
<map:mapper implementation="net.meisen.dissertation.impl.time.mapper.DateMapper" />
<map:mapper implementation="net.meisen.dissertation.impl.time.mapper.LongMapper" />
</map:config>
</mappers>
<!‐‐ Define the factory to be used to determine the granularity‐factory.
The following factories are available by default:
+ net.meisen.dissertation.impl.time.granularity.TimeGranularityFactory
‐‐>
<granularities
implementation="net.meisen.dissertation.impl.time.granularity.TimeGranularityFactory" />
<!‐‐ Define the factory to be used to create queries.
The following factories are available by default:
+ net.meisen.dissertation.impl.parser.query.QueryFactory
‐‐>
<queries implementation="net.meisen.dissertation.impl.parser.query.QueryFactory" />
</factories>
<!‐‐ Adds additional aggregation‐functions or overrides the once available
by default.
The following aggregation‐functions are added by default:
+ net.meisen.dissertation.impl.measures.Count
+ net.meisen.dissertation.impl.measures.Min
+ net.meisen.dissertation.impl.measures.Max
+ net.meisen.dissertation.impl.measures.Sum
+ net.meisen.dissertation.impl.measures.Mean
+ net.meisen.dissertation.impl.measures.Median
+ net.meisen.dissertation.impl.measures.Mode
‐‐>
<aggregations>
<function implementation="net.meisen.dissertation.impl.measures.Count" />
</aggregations>
<!‐‐ Adds additional templates. By default the following are added (and
cannot be overridden:
The following aggregation‐functions are added by default:
+ net.meisen.dissertation.model.dimensions.templates.All
214 Appendix
+ net.meisen.dissertation.model.dimensions.templates.Years
+ net.meisen.dissertation.model.dimensions.templates.Months
+ net.meisen.dissertation.model.dimensions.templates.Days
+ net.meisen.dissertation.model.dimensions.templates.Hours
+ net.meisen.dissertation.model.dimensions.templates.Minutes
+ net.meisen.dissertation.model.dimensions.templates.Seconds
+ net.meisen.dissertation.model.dimensions.templates.Rasters
‐‐>
<timetemplates>
<template implementation="net.meisen.dissertation.model.dimensions.templates.Minutes" />
</timetemplates>
<!‐‐ Specify the analysis techniques available ‐‐>
<analyses>
<analysis id="knn" implementation="net.meisen.dissertation.similarity.TidaKnnAnalysis" />
</analyses >
<!‐‐ Server settings ‐‐>
<server>
<!‐‐ the timeout is defined in minutes ‐‐>
<http port="7000" timeout="30" enable="true" docroot="docroot" />
<!‐‐ the timeout is defined in seconds ‐‐>
<tsql port="7001" timeout="1800000" enable="true" />
<control port="7002" enable="false" />
</server>
</config>
Detailed Overview of the Runtime Performance 215
Detailed Overview of the Runtime Performance
query algorithm avg [s] query algorithm avg [s] #1a TIDA 0.088 #3a Naive 0.594
#1a IntTreeB 0.093 #3b TIDA 0.103
#1a Naive 3.297 #3b IntTreeB 0.137
#1a IntTreeA 3.453 #3b IntTreeA 5.109
#1b TIDA 0.095 #3b Naive 5.125
#1b IntTreeB 0.143 #3c TIDA 0.149
#1b Naive 39.266 #3c IntTreeB 0.549
#1b IntTreeA 39.453 #3c IntTreeA 76.141
#1c TIDA 0.139 #3c Naive 77.953
#1c IntTreeB 0.812 #4a IntTreeB 0.074
#1c Naive 633.422 #4a TIDA 0.094
#1c IntTreeA 651.781 #4a IntTreeA 0.516
#2a TIDA 0.011 #4a Naive 0.547
#2a IntTreeB 0.019 #4b IntTreeB 0.095
#2a Naive 0.375 #4b TIDA 0.105
#2a IntTreeA 0.375 #4b IntTreeA 4.922
#2b TIDA 0.026 #4b Naive 5.078
#2b IntTreeB 0.059 #4c TIDA 0.169
#2b IntTreeA 3.813 #4c IntTreeB 0.345
#2b Naive 4.016 #4c Naive 67.359
#2c TIDA 0.063 #4c IntTreeA 70.891
#2c IntTreeB 0.498 #5a IntTreeB 0.084
#2c IntTreeA 42.328 #5a TIDA 0.087
#2c Naive 43.234 #5a IntTreeA 0.422
#3a TIDA 0.091 #5a Naive 0.453
#3a IntTreeB 0.093 #5b TIDA 0.099
#3a IntTreeA 0.547 #5b IntTreeB 0.115
216 Appendix
query algorithm avg [s] query algorithm avg [s] #5b IntTreeA 4.172 #9 Naive 3.172
#5b Naive 4.234 #9 IntTreeA 3.219
#5c TIDA 0.135 #9 IntTreeB 0.473
#5c IntTreeB 0.384
#5c IntTreeA 57.531
#5c Naive 58.609
#6a IntTreeB 0.002
#6a Naive 0.016
#6a IntTreeA 0.016
#6a TIDA 0.020
#6b IntTreeB 0.017
#6b IntTreeA 0.031
#6b TIDA 0.041
#6b Naive 0.125
#6c TIDA 0.163
#6c IntTreeB 0.179
#6c IntTreeA 0.359
#6c Naive 1.203
#7 TIDA 0.088
#7 IntTreeB 0.267
#7 IntTreeA 353.813
#7 Naive 368.953
#8 TIDA 0.031
#8 IntTreeB 0.053
#8 IntTreeA 0.406
#8 Naive 0.469
#9 TIDA 0.272
3-NN of the Temporal Relational Similarity 217
3-NN of the Temporal Relational Similarity
Bibliography
Abdelouarit, E.; El Merouani, M.; Medouri, A. (2013): Data Warehouse Tuning. The Supremacy
of Bitmap Index. In IJCA 79 (7), pp. 7–10. DOI: 10.5120/13751-1573.
Agarwal, S.; Agrawal, R.; Deshpande, P.; Gupta, A.; Naughton, J. F.; Ramakrishnan, R.; Sara-
wagi, S. (1996): On the Computation of Multidimensional Aggregates. In: Proceedings of the
22th International Conference on Very Large Data Bases. San Francisco, CA, USA: Morgan
Kaufmann Publishers Inc (VLDB ’96), pp. 506–521.
Aigner, W.; Federico, P.; Gschwandtner, T.; Miksch, S.; Alexander Rind (Eds.) (2012): Chal-
lenges of Time-oriented Data in Visual Analytics for Healthcare. IEEE VisWeek Workshop on
Visual Analytics in Healthcare (VAHC). Seattle, USA. IEEE: IEEE.
Aigner, W.; Miksch, S.; Müller, W.; Schumann, H.; Tominski, C. (2007): Visualizing time-ori-
ented data - A systematic view. In Computers & Graphics 31 (3), pp. 401–409. DOI:
10.1016/j.cag.2007.01.030.
Aigner, W.; Miksch, S.; Schumann, H.; Tominski, C. (2011): Visualization of Time-Oriented
Data. Guildford, Surrey: Springer London (Human-Computer Interaction Series).
Allen, J. F. (1983): Maintaining Knowledge about Temporal Intervals. In Communications of
the ACM 26 (11), pp. 832–843. DOI: 10.1145/182.358434.
Alur, R.; Henzinger, T. A. (1992): Logics and models of real time: A survey. In J. W. de Bakker,
C. Huizing, W. P. de Roever, G. Rozenberg (Eds.): Real-Time: Theory in Practice, vol. 600:
Springer Berlin Heidelberg (Lecture Notes in Computer Science), pp. 74–106.
Al-Zoubi, H.; Milenkovic, A.; Milenkovic, M. (2004): Performance evaluation of Cache Replace-
ment Policies for the SPEC CPU2000 Benchmark Suite. In S.-M. Yoo, L. H. Etzkorn (Eds.):
42nd annual Southeast Regional Conference. Huntsville, Alabama, pp. 267–272.
Apache Shiro Group (2015): Apache Shiro. Java Security Framework. Available online at
http://shiro.apache.org/, updated on 6/12/2015, checked on 6/12/2015.
Arroyo, J.; González-Rivera, G.; Maté, C. (2010): Forecasting with Interval and Histogram
Data. In A. Ullah, D. Giles (Eds.): Handbook of Empirical Economics and Finance, vol.
20103666: Chapman and Hall/CRC (Statistics: A Series of Textbooks and Monographs),
pp. 247–279.
Batal, I.; Valizadegan, H.; Cooper, G. F.; Hauskrecht, M. (2011): A Pattern Mining Approach
for Classifying Multivariate Temporal Data. In F.-X. Wu (Ed.): IEEE International Conference
on Bioinformatics and Biomedicine (BIBM 2011), vol. 2011. IEEE International Conference on
Bioinformatics and Biomedicine (BIBM). Atlanta, Georgia, USA, 12 - 15 Nov. 2011. Pisca-
taway, NJ: IEEE, pp. 358–365.
© Springer Fachmedien Wiesbaden GmbH 2016P. Meisen, Analyzing Time Interval Data,DOI 10.1007/978-3-658-15728-9
220 Bibliography
Bayer, R.; McCreight, E. M. (1972): Organization and Maintenance of Large Ordered Indexes.
In Acta Informatica 1 (3), pp. 173–189. DOI: 10.1007/BF00288683.
Bębel, B.; Morzy, M.; Morzy, T.; Królikowski, Z.; Wrembel, R. (2012): OLAP-Like Analysis of
Time Point-Based Sequential Data. In D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F.
Mattern, J. C. Mitchell et al. (Eds.): Advances in Conceptual Modeling, vol. 7518. Berlin, Hei-
delberg: Springer Berlin Heidelberg (Lecture Notes in Computer Science), pp. 153–161.
Bentley, J. L. (1977): Solutions to Klee’s Rectangle Problems. In Unpublished manuscript,
Dept of Computer Science, Carnegie-Mellon University, Pittsburgh PA.
Berendt, B. (1996): Explaining Preferred Mental Models in Allen Inferences with a Metrical
Model of Imagery. In G. W. Cottrell (Ed.): Proceedings of the 18th Annual Conference of the
Cognitive Science Society. 18th Annual Conference of the Cognitive Science Society: Law-
rence Erlbaum, pp. 489–494.
Berg, M. de; Cheong, O.; van Kreveld, M.; Overmars, M. (2008): More Geometric Data Struc-
tures. In M. de Berg, O. Cheong, M. van Kreveld, M. Overmars (Eds.): Computational Geom-
etry. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 219–241.
Bergeron, M.; Conklin, D. (2011): Subsumption of Vertical Viewpoint Patterns. In C. Agon, M.
Andreatta, G. Assayag, E. Amiot, J. Bresson, J. Mandereau (Eds.): Mathematics and Compu-
tation in Music, vol. 6726: Springer Berlin Heidelberg (Lecture Notes in Computer Science),
pp. 1–12.
Böhlen, M.; Gamper, J.; Jensen, C. S. (2006): Multi-dimensional Aggregation for Temporal
Data. In D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mitchell et al.
(Eds.): Advances in Database Technology - EDBT 2006, vol. 3896. Berlin, Heidelberg:
Springer Berlin Heidelberg (Lecture Notes in Computer Science), pp. 257–275.
Böhlen, M. H.; Gamper, J.; Jensen, C. S. (2008): Towards General Temporal Aggregation. In
A. Gray, K. Jeffery, J. Shao (Eds.): Sharing Data, Information and Knowledge, vol. 5071. Berlin,
Heidelberg: Springer Berlin Heidelberg (Lecture Notes in Computer Science), pp. 257–269.
Böhlen, M. H.; Jensen, C. S.; Snodgrass, R. T. (1995): A Seamless Integration of Time
into SQL. Technical Report. Aalborg University (TRR-96-2049).
Böhlen, M. H.; Jensen, C. S.; Snodgrass, R. T. (2000): Temporal statement modifiers. In ACM
Trans. Database Syst. 25 (4), pp. 407–456. DOI: 10.1145/377674.377665.
Bongki Moon; Vega Lopez, I. F.; Immanuel, V. (2003): Efficient Algorithms for large-scale Tem-
poral Aggregation. In IEEE Trans. Knowl. Data Eng. 15 (3), pp. 744–759. DOI:
10.1109/TKDE.2003.1198403.
Bibliography 221
Boonstra-Hörwein, K.; Punzengruber, D.; Gärtner, J. (2011): Reducing understaffing and shift
work with Temporal Profile Optimization (TPO). In Applied Ergonomics 42 (2), pp. 233–237.
DOI: 10.1016/j.apergo.2010.06.008.
Brett Wooldridge (2015): JHM benchmarks for JDBC Connection Pools. GitHub.com. Availa-
ble online at https://github.com/brettwooldridge/HikariCP-benchmark, updated on 5/26/2015,
checked on 6/12/2015.
Carmel, E. (1999): Global software teams. Collaborating across borders and time zones. Up-
per Saddle River, NJ: Prentice Hall.
Catarci, T.; Santucci, G. (1995): Diagrammatic vs Textual Query Languages: A Comparative
Experiment. In S. Spaccapietra, R. Jain (Eds.): Visual Database Systems 3. Boston, MA:
Springer US, pp. 69–83.
Celko, J. (2006): Joe Celko's analytics and OLAP in SQL. San Francisco, Calif., Oxford: Mor-
gan Kaufmann; Elsevier Science [distributor] (The Morgan Kaufmann series in data manage-
ment systems).
Chamberlin, D. D.; Boyce, R. F. (1976): SEQUEL: A Structured English Query Language. In G.
Altshuler, R. Rustin, B. Plagman (Eds.): ACM SIGFIDET (now SIGMOD) workshop. Not
Known, pp. 249–264.
Chambi, S.; Lemire, D.; Kaser, O.; Godin, R. (2015): Better bitmap performance with Roaring
bitmaps. In Softw. Pract. Exper. DOI: 10.1002/spe.2325.
Chan, C.-Y.; Ioannidis, Y. E. (1998): Bitmap Index Design and Evaluation. In L. Haas, P. Drew,
A. Tiwary, M. Franklin (Eds.): ACM SIGMOD International Conference. Seattle, Washington,
United States, pp. 355–366.
Chan, C.-Y.; Ioannidis, Y. E. (1999): An Efficient Bitmap Encoding Scheme for Selection Que-
ries. In S. B. Davidson, C. Faloutsos (Eds.): ACM SIGMOD International Conference. Phila-
delphia, Pennsylvania, United States, pp. 215–226.
Chaudhuri, S.; Dayal, U. (1997): An overview of data warehousing and OLAP technology. In
SIGMOD Rec. 26 (1), pp. 65–74. DOI: 10.1145/248603.248616.
Chen, Y.-C.; Peng, W.-C.; Lee, S.-Y. (2011): CEMiner - An Efficient Algorithm for Mining Closed
Patterns from Time Interval-Based Data. In: IEEE 11th International Conference on Data Min-
ing (ICDM 2011). Vancouver, BC, Canada, pp. 121–130.
Christie, R. D. (2003): Statistical classification of major event days in distribution system reli-
ability. In IEEE Trans. Power Delivery 18 (4), pp. 1336–1341. DOI:
10.1109/TPWRD.2003.810491.
222 Bibliography
Chui, C. K.; Kao, B.; Lo, E.; Cheung, D. (2010): S-OLAP: An OLAP system for analyzing se-
quence data. In A. Elmagarmid, D. Agrawal (Eds.): Proceedings of the 2010 ACM SIGMOD
International Conference on Management of Data. Indianapolis, Indiana, USA, pp. 1131–
1134.
Colantonio, A.; Di Pietro, R. (2010): Concise: Compressed ‘n’ Composable Integer Set. In In-
formation Processing Letters 110 (16), pp. 644–650. DOI: 10.1016/j.ipl.2010.05.018.
Combi, C.; Gozzi, M.; Juarez, J. M.; Marin, R.; Oliboni, B. (2007): Querying Clinical Workflows
by Temporal Similarity. In R. Bellazzi, A. Abu-Hanna, J. Hunter (Eds.): Artificial Intelligence in
Medicine, vol. 4594: Springer Berlin Heidelberg (Lecture Notes in Computer Science),
pp. 469–478.
Cood, E. F.; Cood, S. B.; Salley, C. T. (1993): Providing OLAP (On-line Analytical Processing)
to User-analysts: An IT Mandate. (White Paper). Available online at http://www.minet.uni-
jena.de/dbis/lehre/ss2005/sem_dwh/lit/Cod93.pdf, checked on 5/12/2015.
Cuzzocrea, A. (2011): Retrieving Accurate Estimates to OLAP Queries over Uncertain and
Imprecise Multidimensional Data Streams. In D. Hutchison, T. Kanade, J. Kittler, J. M. Klein-
berg, F. Mattern, J. C. Mitchell et al. (Eds.): Scientific and Statistical Database Management,
vol. 6809. Berlin, Heidelberg: Springer Berlin Heidelberg (Lecture Notes in Computer Sci-
ence), pp. 575–576.
Deliège, F.; Pedersen, T. B. (2010): Position List Word Aligned Hybrid: Optimizing Space and
Performance for Compressed Bitmaps. In I. Manolescu, S. Spaccapietra, J. Teubner, M.
Kitsuregawa, A. Leger, F. Naumann et al. (Eds.): 13th International Conference. Lausanne,
Switzerland, pp. 228–239.
DeWitt, D. J.; Katz, R. H.; Olken, F.; Shapiro, L. D.; Stonebraker, M. R.; Wood, D. (1984): Im-
plementation Techniques for Main Memory Database Systems. In D. Smith, B. Yormark (Eds.):
ACM SIGMOD International Conference. Boston, Massachusetts, p. 1.
Dignös, A.; Böhlen, M. H.; Gamper, J. (2014): Overlap Interval Partition Join. In C. Dyreson, F.
Li, M. T. Özsu (Eds.): ACM SIGMOD International Conference. Snowbird, Utah, USA,
pp. 1459–1470.
Dodge, Y.; Marriott, F. H. C. (2006): The Oxford dictionary of statistical terms. 6th ed. Oxford,
New York: Oxford University Press.
Dyreson, C.; Grandi, F.; Käfer, W.; Kline, N.; Lorentzos, N.; Mitsopoulos, Y. et al. (1994): A
Consensus Glossary of Temporal Database Concepts. In SIGMOD Rec 23 (1), pp. 52–64.
DOI: 10.1145/181550.181560.
Edelsbrunner, H.; Maurer, H. A. (1981): On the Intersection of Orthogonal Objects. In Infor-
mation Processing Letters 13 (4, 5), pp. 177–181.
Bibliography 223
Enderle, J.; Hampel, M.; Seidl, T. (2004): Joining Interval Data in Relational Databases. In P.
Valduriez, G. Weikum, A. C. König, S. Dessloch (Eds.): ACM SIGMOD International Con-
ference. Paris, France, pp. 683–694.
Espinosa, J. A.; Ning Nan; Carmel, E. (2007): Do Gradations of Time Zone Separation Make
a Difference in Performance? A First Laboratory Study. In: Global Software Engineering, 2007.
ICGSE 2007. Second IEEE International Conference on, pp. 12–22.
Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P. (1996): The KDD process for extracting useful
knowledge from volumes of data. In Communications of the ACM 39 (11), pp. 27–34. DOI:
10.1145/240455.240464.
Fricker, D.; Zhang, H.; Yu, C. (2011): Sequential Pattern Mining of Multimodal Data Streams in
Dyadic Interactions. In: 2011 IEEE Joint International Conference on Development and Learn-
ing and Epigenetic Robotics. IEEE Joint International Conference on Development and Learn-
ing and Epigenetic Robotics. Frankfurt am Main, Germany, 24-27 August 2011. Piscataway:
IEEE, pp. 1–6.
Frühwirth, T. (1996): Temporal Annotated Constraint Logic Programming. In Journal of Sym-
bolic Computation 22 (5–6), pp. 555–583. DOI: 10.1006/jsco.1996.0066.
Garcia-Molina, H.; Ullman, J. D.; Widom, J. (2014): Database Systems. The Complete Book.
Second edition. Harlow: Pearson Education Limited.
Goodchild, M. F. (1987): A Spatial Analytical Perspective on Geographical Information Sys-
tems. In International journal of geographical information systems 1 (4), pp. 327–334. DOI:
10.1080/02693798708927820.
Goodrich, M. T.; Tamassia, R. (2006): Data Structures and Algorithms in Java. 4th ed. Hobo-
ken, NJ: Wiley.
Gordevicius, J.; Gamper, J.; Böhlen, M. H. (2012): Parsimonious Temporal Aggregation. In The
VLDB Journal 21 (3), pp. 309–332. DOI: 10.1007/s00778-011-0243-9.
Grabbe, S. R.; Sridhar, B.; Mukherjee, A. (2014): Clustering Days with Similar Airport Weather
Conditions. In: 14th AIAA Aviation Technology, Integration, and Operations Conference. At-
lanta, GA.
Gui, H.; Au, G.; Bouloy, C. (2011): Aggregate Join Index Utilization in Query Processing:
Google Patents. Available online at https://www.google.com/patents/US7912833.
Guo, H.; Tang, Y.; Yang, X.; Ye, X. (2010): Improvement and Extension to ATSQL2. In Y. Tang,
X. Ye, N. Tang (Eds.): Temporal Information Processing Technology and Its Application. Berlin,
Heidelberg: Springer Berlin Heidelberg, pp. 245–259.
Gupta, A.; Mumick, I. S. (Eds.) (1999): Materialized Views. Cambridge, MA, USA: MIT Press.
224 Bibliography
Guttman, A. (1984): R-Trees: A Dynamic Index Structure for Spatial Searching. In SIGMOD
Rec. 14 (2), pp. 47–57. DOI: 10.1145/971697.602266.
Guyet, T.; Quiniou, R. (2008): Mining Temporal Patterns with Quantitative Intervals. In: 2008
IEEE International Conference on Data Mining Workshops (ICDMW). Pisa, Italy, pp. 218–227.
Han, J.; Lakshmanan, L.; Ng, R. T. (1999): Constraint-based, multidimensional data mining. In
Computer 32 (8), pp. 46–50. DOI: 10.1109/2.781634.
Handy, J. (1998): The Cache Memory Book. 2nd ed. San Diego: Academic Press.
Hashemi, A. H.; Kaeli, D. R.; Calder, B. (1997): Efficient procedure mapping using cache line
coloring. In M. Chen, R. K. Cytron, A. M. Berman (Eds.): ACM SIGPLAN 1997 conference. Las
Vegas, Nevada, United States, pp. 171–182.
Heuer, R. J., Jr.; Pherson, R. H. (2014): Structured Analytic Techniques for Intelligence Anal-
ysis. Edited by Randolph H. Pherson: CQ Press.
Hu, Y.-H.; Cheng, C.; Wu, F.; Yang, C.-I. (2010): Mining Multi-level Time-interval Sequential
Patterns in Sequence Databases. In G. Kou (Ed.): 2nd International Conference on Software
Engineering and Data Mining (SEDM 2010). Chengdu, China, 23 - 25 June 2010. International
Conference on Software Engineering and Data Mining (SEDM). Piscataway, NJ: IEEE,
pp. 416–421.
Hudry, J. L. (2004): Is Time in Physics Discrete, Dense, or Continuous? In: Proceedings of the
1th International Conference on the Ontology of Spacetime. 1th International Conference on
the Ontology of Spacetime. Montréal, Canada, May 11-14, 2004.
Hutchison, D.; Kanade, T.; Kittler, J.; Kleinberg, J. M.; Mattern, F.; Mitchell, J. C. et al. (Eds.)
(2006): Data Warehousing and Knowledge Discovery. Berlin, Heidelberg: Springer Berlin Hei-
delberg (Lecture Notes in Computer Science).
IBM Corporation (2013): Descriptive, predictive, prescriptive: Transforming asset and facilities
management with analytics. Choose the right data analytics solutions to boost service quality,
reduce operating costs and build ROI. (White paper (external)-USEN). IBM Corporation. Avail-
able online at http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=TIW14162USEN,
updated on October 2013, checked on 5/7/2015.
Jones, R.; Hosking, A.; Moss, E. (2012): The Garbage Collection Handbook. The Art of Auto-
matic Memory Management. Boca Raton, FL: CRC Press (Chapman & Hall/CRC applied al-
gorithms and data structures series).
Kamaev, V.; Finogeev, A.; Finogeev, A.; Shevchenko, S. (2014): Knowledge Discovery in the
SCADA Databases Used for the Municipal Power Supply System. In A. Kravets, M. Shcher-
bakov, M. Kultsova, T. Iijima (Eds.): Knowledge-Based Software Engineering, vol. 466. Cham:
Bibliography 225
Springer International Publishing (Communications in Computer and Information Science),
pp. 1–14.
Kaser, O.; Lemire, D. (2014): Compressed Bitmap Indexes: Beyond Unions and Intersections.
In CoRR abs/1402.4466.
Keim, D. (2010): Mastering the information age. Solving problems with visual analytics. Goslar:
Eurographics Association.
Keogh, E.; Ratanamahatana, C. A. (2005): Exact Indexing of Dynamic Time Warping. In
Knowledge and Information Systems 7 (3), pp. 358–386. DOI: 10.1007/s10115-004-0154-9.
Kimball, R.; Ross, M. (2002): The Data Warehouse Toolkit. The complete guide to dimensional
modeling. 2nd ed. New York: Wiley.
Kline, N.; Snodgrass, R. T. (1995): Computing Temporal Aggregates. In: Eleventh International
Conference on Data Engineering. Taipei, Taiwan, 6-10 March 1995, pp. 222–231.
Koncilia, C.; Morzy, T.; Wrembel, R.; Eder, J. (2014): Interval OLAP: Analyzing Interval Data.
In L. Bellatreche, M. K. Mohania (Eds.): Data Warehousing and Knowledge Discovery, vol.
8646. Cham: Springer International Publishing (Lecture Notes in Computer Science),
pp. 233–244.
Kostakis, O.; Papapetrou, P.; Hollmén, J. (2011): ARTEMIS: Assessing the Similarity of Event-
Interval Sequences. In D. Gunopulos, T. Hofmann, D. Malerba, M. Vazirgiannis (Eds.): Machine
Learning and Knowledge Discovery in Databases, vol. 6912: Springer Berlin Heidelberg (Lec-
ture Notes in Computer Science), pp. 229–244.
Kotsifakos, A.; Papapetrou, P.; Athitsos, V. (2013): IBSM: Interval-Based Sequence Matching.
65. In: Proceedings of the 2013 SIAM International Conference on Data Mining. SIAM Inter-
national Conference on Data Mining, pp. 596–604.
Kranjec, A.; Chatterjee, A. (2010): Are temporal concepts embodied? A challenge for cognitive
neuroscience. In Front Psychol 1, p. 240. DOI: 10.3389/fpsyg.2010.00240.
Kriegel, H.-P.; Pötke, M.; Seidl, T. (2001): Object-Relational Indexing for General Interval Re-
lationships. In G. Goos, J. Hartmanis, J. van Leeuwen, C. S. Jensen, M. Schneider, B. Seeger,
V. J. Tsotras (Eds.): Advances in Spatial and Temporal Databases, vol. 2121. Berlin, Heidel-
berg: Springer Berlin Heidelberg (Lecture Notes in Computer Science), pp. 522–542.
Kuhn, H. W. (1955): The Hungarian method for the assignment problem. In Naval Research
Logistics 2 (1-2), pp. 83–97. DOI: 10.1002/nav.3800020109.
226 Bibliography
Lammarsch, T.; Aigner, W.; Bertone, A.; Gärtner, J.; Mayr, E.; Miksch, S.; Smuc, M. (2009):
Hierarchical Temporal Patterns and Interactive Aggregated Views for Pixel-Based Visualiza-
tions. In: 13th International Conference Information Visualisation. Barcelona, Spain, pp. 44–
50.
Laxman, S.; Sastry, P. S. (2006): A survey of temporal data mining. In SADHANA - Academy
Proceedings in Engineering Sciences 31 (2), pp. 173–198.
Laxman, S.; Sastry, P. S.; Unnikrishnan, K. P. (2007): A fast algorithm for finding frequent epi-
sodes in event streams. In P. Berkhin, R. Caruana, X. Wu (Eds.): 13th ACM SIGKDD Interna-
tional Conference. San Jose, California, USA, pp. 410–419.
Lemire, D.; Kaser, O. (2011): Reordering Columns for Smaller Indexes. In Information Sci-
ences 181 (12), pp. 2550–2570. DOI: 10.1016/j.ins.2011.02.002.
Lemire, D.; Kaser, O.; Aouiche, K. (2010): Sorting improves word-aligned bitmap indexes. In
Data & Knowledge Engineering 69 (1), pp. 3–28. DOI: 10.1016/j.datak.2009.08.006.
Lenz, H.-J.; Shoshani, A. (1997): Summarizability in OLAP and statistical data bases. In: Pro-
ceedings of the 9th International Conference on Scientific and Statistical Database Manage-
ment, pp. 132–143.
Liu, H.; Hussain, F.; Tan, C.; Dash, M. (2002): Discretization: An Enabling Technique. In Data
Mining and Knowledge Discovery 6 (4), pp. 393–423. DOI: 10.1023/A:1016304305535.
Liu, M.; Rundensteiner, E.; Greenfield, K.; Gupta, C.; Wang, S.; Ari, I.; Mehta, A. (2011): E-
Cube: multi-dimensional event sequence analysis using hierarchical pattern query sharing. In
T. Sellis, R. J. Miller (Eds.): Proceedings of the 2011 ACM SIGMOD International Conference
on Management of data. Athens, Greece, pp. 889–900.
Liu, Z.; Jiang, B.; Heer, J. (2013): imMens. Real-time Visual Querying of Big Data. In Computer
Graphics Forum 32 (3pt4), pp. 421–430. DOI: 10.1111/cgf.12129.
Lorentzos, N. A.; Mitsopoulos, Y. G. (1997): SQL Extension for Interval Data. In IEEE Trans.
Knowl. Data Eng. 9 (3), pp. 480–499. DOI: 10.1109/69.599935.
Mansmann, S.; Scholl, M. H. (2006): Extending Visual OLAP for Handling Irregular Dimen-
sional Hierarchies. In D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mit-
chell et al. (Eds.): Data Warehousing and Knowledge Discovery, vol. 4081. Berlin, Heidelberg:
Springer Berlin Heidelberg (Lecture Notes in Computer Science), pp. 95–105.
Martin, R. C. (2009): Clean Code. A Handbook of Agile Software Craftsmanship. Upper Saddle
River, NJ: Prentice Hall (Robert C. Martin series).
Bibliography 227
Mazón, J.-N.; Lechtenbörger, J.; Trujillo, J. (2008): Solving Summarizability Problems in Fact-
Dimension Relationships for Multidimensional Models. In: Proceedings of the ACM 11th Inter-
national Workshop on Data Warehousing and OLAP. Napa Valley, California, USA. New York,
NY, USA: ACM (DOLAP ’08), pp. 57–64.
Mazón, J.-N.; Lechtenbörger, J.; Trujillo, J. (2009): A survey on summarizability issues in mul-
tidimensional modeling. In Data & Knowledge Engineering 68 (12), pp. 1452–1469. DOI:
10.1016/j.datak.2009.07.010.
Mazón, J.-N.; Lechtenbörger, J.; Trujillo, J. (2011): A Model-Driven Approach for Enforcing
Summarizability in Multidimensional Modeling. In D. Hutchison, T. Kanade, J. Kittler, J. M.
Kleinberg, F. Mattern, J. C. Mitchell et al. (Eds.): Advances in Conceptual Modeling. Recent
Developments and New Directions, vol. 6999. Berlin, Heidelberg: Springer Berlin Heidelberg
(Lecture Notes in Computer Science), pp. 65–74.
Meisen, P.; Keng, D.; Meisen, T.; Recchioni, M. (2015a): TIDAQL: A Query Language enabling
On-line Analytical Processing of Time Interval Data. In: Proceedings of 17th International Con-
ference on Enterprise Information Systems (ICEIS2015). 17th International Conference on
Enterprise Information Systems (ICEIS2015). Barcelona, Spain, 27.-30.04. INSTICC.
Meisen, P.; Keng, D.; Meisen, T.; Recchioni, M.; Jeschke, S. (2015b): Bitmap-Based On-Line
Analytical Processing of Time Interval Data. In Shahram Latifi (Ed.): Proceedings of the 12th
International Conference on Information Technology: New Generations (ITNG), 2015.
Meisen, P.; Recchioni, M.; Meisen, T.; Schilberg, D.; Jeschke, S. (2014): Modeling and pro-
cessing of time interval data for data-driven decision support. In: 2014 IEEE International
Conference on Systems, Man and Cybernetics - SMC. San Diego, CA, USA, October 5-8,
2014. San Diego, California, USA, pp. 2946–2953.
Meisen, T.; Meisen, P.; Schilberg, D.; Jeschke, S. (2012): Adaptive Information Integration:
Bridging the Semantic Gap between Numerical Simulations. In W. van der Aalst, J. Mylopou-
los, M. Rosemann, M. J. Shaw, C. Szyperski, R. Zhang et al. (Eds.): Enterprise Information
Systems, vol. 102. Berlin, Heidelberg: Springer Berlin Heidelberg (Lecture notes in business
information processing), pp. 51–65.
Mendoza, M.; Alegría, E.; Maca, M.; Cobos, C.; León, E. (2015): Multidimensional analysis
model for a document warehouse that includes textual measures. In Decision Support Sys-
tems 72, pp. 44–59. DOI: 10.1016/j.dss.2015.02.008.
Merriam-Webster (2015): Analysis - merriam-webster.com. Available online at
http://www.merriam-webster.com/dictionary/analysis, checked on 4/19/2015.
228 Bibliography
Moerchen, F. (2006a): Algorithms for Time Series Knowledge Mining. In: Proceedings of the
12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New
York, NY, USA: ACM (KDD ’06), pp. 668–673.
Moerchen, F. (2006b): Time series knowledge mining. Marburg: Görich & Weiershäuser (Wis-
senschaft in Dissertationen, 813).
Moerchen, F. (2009): Tutorial CIDM-T Temporal pattern mining in symbolic time point and time
interval data. In: Computational Intelligence and Data Mining, 2009. CIDM ’09. IEEE Sympo-
sium on, p. xiv.
Moerchen, F.; Fradkin, D. (2010): Robust mining of time intervals with semi-interval partial
order patterns. In S. Parthasarathy, B. Liu, B. Goethals, J. Pei, C. Kamath (Eds.): Proceedings
of the 2010 SIAM International Conference on Data Mining. Philadelphia, PA: Society for In-
dustrial and Applied Mathematics, pp. 315–326.
Niemi, T.; Niinimäki, M.; Thanisch, P.; Nummenmaa, J. (2014): Detecting summarizability in
OLAP. In Data & Knowledge Engineering 89, pp. 1–20. DOI: 10.1016/j.datak.2013.11.001.
Oracle Cooperation (2015): Oracle Technology Global Price List. Software Investment Guide.
Oracle Cooperation. Available online at http://www.oracle.com/us/corporate/pricing/
technology-price-list-070617.pdf, updated on April 2015, checked on 6/14/2015.
Ossimitz, G.; Mrotzek, M. (2008): The Basics of System Dynamics: Discrete vs. Continuous
Modelling of Time. In B. G. Dangerfield (Ed.): Proceedings of the 2008 International Confer-
ence of the System Dynamics Society.
Pak Chung Wong; Thomas, J. (2004): Visual Analytics. In IEEE Comput. Grap. Appl. 24 (5),
pp. 20–21. DOI: 10.1109/MCG.2004.39.
Papapetrou, P.; Kollios, G.; Sclaroff, S. (2005): Discovering frequent arrangements of temporal
intervals. In: Proceedings of the 5th IEEE International Conference on Data Mining (ICDM’05):
IEEE Press, pp. 354–361.
Papapetrou, P.; Kollios, G.; Sclaroff, S.; Gunopulos, D. (2009): Mining frequent arrangements
of temporal intervals. In Knowledge and Information Systems 21 (2), pp. 133–171. DOI:
10.1007/s10115-009-0196-0.
Paramonov, V.; Fedorov, R.; Ruzhnikov, G.; Shumilov, A. (2013): Web-Based Analytical Infor-
mation System for Spatial Data Processing. In T. Skersys, R. Butleris, R. Butkiene (Eds.): In-
formation and Software Technologies, vol. 403. Berlin, Heidelberg: Springer Berlin Heidelberg
(Communications in Computer and Information Science), pp. 93–101.
Bibliography 229
Pascoe, C. (2011): Time, technology and leaping seconds. Available online at http://google-
blog.blogspot.de/2011/09/time-technology-and-leaping-seconds.html, updated on 9/15/2011,
checked on 4/24/2015.
Pedersen, T. B.; Jensen, C. S.; Dyreson, C. E. (1999): Extending Practical Pre-Aggregation in
On-Line Analytical Processing. In M. Atkinson (Ed.): Proceedings of the Twenty-fifth Interna-
tional Conference on Very Large Databases, Edinburgh, Scotland, UK, 7-10 September, 1999.
Orlando, FL: Morgan Kaufmann, pp. 663–674.
Pederson, T. B. (2000): Aspects of Data Modeling and Query Processing for Complex Multidi-
mensional Data. Ph.D. thesis. 4 volumes: Aalborg Universitetsforlag, Aalborg. Publication: De-
partment of Computer Science.
Peter, S.; Höppner, F. (2010): Finding Temporal Patterns Using Constraints on (Partial) Ab-
sence, Presence and Duration. In D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mat-
tern, J. C. Mitchell et al. (Eds.): Knowledge-Based and Intelligent Information and Engineering
Systems, vol. 6276. Berlin, Heidelberg: Springer Berlin Heidelberg (Lecture Notes in Compu-
ter Science), pp. 442–451.
Power, D. (2001): What are Analytical Information Systems? In DSS News 2 (22).
Power, D. (2012): What is analytics? In DSS News 13 (18).
Rego, H.; Mendes, A. B.; Guerra, H. (2015): A Decision Support System for Municipal Budget
Plan Decisions. In A. Rocha, A. M. Correia, S. Costanzo, L. P. Reis (Eds.): New Contributions
in Information Systems and Technologies, vol. 354. Cham: Springer International Publishing
(Advances in Intelligent Systems and Computing), pp. 129–139.
Rind, A.; Lammarsch, T.; Aigner, W.; Alsallakh, B.; Miksch, S. (2013): TimeBench: A Data
Model and Software Library for Visual Analytics of Time-Oriented Data. In IEEE Trans Vis
Comput Graph 19 (12), pp. 2247–2256. DOI: 10.1109/TVCG.2013.206.
Roddick, J. F.; Mooney, C. H. (2005): Linear temporal sequences and their interpretation using
midpoint relationships. In Knowledge and Data Engineering, IEEE Transactions on 17 (1),
pp. 133–135. DOI: 10.1109/TKDE.2005.12.
Roh, J.-w.; Hwang, S.-w.; Yi, B.-K. (2012): Efficient bitmap-based Indexing of time-based In-
terval Sequences. In Information Sciences 194, pp. 38–56. DOI: 10.1016/j.ins.2011.08.013.
Ronald Weiss (2012): A Technical Overview of the Oracle Exadata Database Machine and
Exadata Storage Server. (White Paper). Oracle Cooperation (Oracle White Paper). Available
online at http://www.oracle.com/technetwork/database/exadata/exadata-technical-whitepa-
per-134575.pdf, updated on June 2012, checked on 6/14/2015.
230 Bibliography
Russo, M.; Ferrari, A. (2011): The Many-to-Many Revolution 2.0. (White Paper) (Version 2.0,
Revision 1). Available online at http://www.sqlbi.com/wp-content/uploads/The_Many-to-
Many_Revolution_2.0.pdf, updated on 10/10/2011, checked on 5/11/2015.
Sadasivam, R.; Duraiswamy, K. (2013): Efficient approach to discover interval-based sequen-
tial patterns 9 (2), pp. 225–234. DOI: 10.3844/jcssp.2013.225.234.
Schutt, R.; O'Neil, C. (2014): Doing data science. First edition. Sebastopol, CA: O'Reilly Media,
Inc.
Shneiderman, B. (1996): The Eyes Have It: A Task by Data Type Taxonomy for Information
Visualizations. In: IEEE Symposium on Visual Languages. Boulder, CO, USA, 3-6 Sept. 1996,
pp. 336–343.
Snodgrass, R. T. (1995): The TSQL2 Temporal Query Language. New York: Springer.
Song, I.-Y.; Medsker, C.; Ewen, E.; Rowen, W. (2001): An Analysis of Many-to-Many Relation-
ships Between Fact and Dimension Tables in Dimensional Modeling. In: Proceedings of the
Int’l Workshop on Design and Management of Data Warehouses, p. 6.
Sorin, D. J.; Hill, M. D.; Wood, D. A. (2011): A Primer on Memory Consistency and Cache
Coherence. Synthesis Lectures on Computer Architecture. [San Rafael, Calif.]: Morgan &
Claypool Publishers (Synthesis lectures on computer architecture, 16).
Spaccapietra, S.; Zimányi, E.; Song, I.-Y. (2009): Journal on data semantics XIII. Berlin:
Springer-Verlag (Lecture Notes in Computer Science, 5530).
Spofford, G. (2006): MDX Solutions with Microsoft SQL Server Analysis Services 2005 and
Hyperion Essbase. 2nd ed. Indianapolis, IN: Wiley Pub.
Stockinger, K.; Wu, K.; Shoshani, A. (2004): Evaluation Strategies for Bitmap Indices with Bin-
ning. In D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mitchell et al. (Eds.):
Database and Expert Systems Applications, vol. 3180. Berlin, Heidelberg: Springer Berlin
Heidelberg (Lecture Notes in Computer Science), pp. 120–129.
Stroh, F.; Winter, R.; Wortmann, F. (2011): Method Support of Information Requirements Anal-
ysis for Analytical Information Systems. In Bus Inf Syst Eng 3 (1), pp. 33–43. DOI:
10.1007/s12599-010-0138-0.
Tao, Y.; Papadias, D.; Faloutsos, C. (2004): Approximate temporal aggregation. In: Proceed-
ings. 20th International Conference on Data Engineering. Boston, MA, USA, 30 March-2 April
2004, pp. 190–201.
Teiken, Y. (2012): Automatic model driven analytical information systems. Berlin: Logos-Verl.
Bibliography 231
Thomas, J. J.; Cook, K. A. (2005): Illuminating the Path. Los Alamitos, California: IEEE Com-
puter Society.
Toman, D. (2000): SQL/TP: A Temporal Extension of SQL. In G. Kuper, L. Libkin, J. Paredaens
(Eds.): Constraint Databases. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 391–399.
van Schaik, S. J.; Moor, O. de (2011): A memory efficient reachability data structure through
bit vector compression. In T. Sellis, R. J. Miller (Eds.): Proceedings of the 2011 ACM SIGMOD
International Conference on Management of data. Athens, Greece, pp. 913–924.
van Wijk, J. J.; van Selow, E. R. (1999): Cluster and calendar based visualization of time series
data. In: IEEE Symposium on Information Visualization (InfoVis'99). San Francisco, CA, USA,
24-29 Oct. 1999, pp. 4–9.
Wang, Y.; Ye, X. (2014): Index-Based OLAP Aggregation for In-Memory Cluster Computing.
In: 2014 International Conference on Cloud Computing and Big Data (CCBD). Wuhan, China,
pp. 148–151.
Whibberley, P. B.; Davis, J. A.; Shemar, S. L. (2011): Local representations of UTC in national
laboratories. In Metrologia 48 (4), pp. S154-S164. DOI: 10.1088/0026-1394/48/4/S05.
White, C. (2005): Data Integration: Using ETL, EAI, and EII Tools to Create an Integrated
Enterprise. TDWI (November). Available online at http://download.101com.com/tdwi/re-
search_report/DIRR_Report.pdf, updated on November 2005, checked on 5/12/2015.
Winarko, E.; Roddick, J. F. (2007): ARMADA – An algorithm for discovering richer relative
temporal association rules from interval-based data. In Data & Knowledge Engineering 63 (1),
pp. 76–90. DOI: 10.1016/j.datak.2006.10.009.
Wu, K.; Ahern, S.; Bethel, E. W.; Chen, J.; Childs, H.; Cormier-Michel, E. et al. (2009): FastBit.
Interactively Searching Massive Data. In J. Phys.: Conf. Ser. 180, p. 12053. DOI:
10.1088/1742-6596/180/1/012053.
Yang, J.; Widom, J. (2003): Incremental computation and maintenance of temporal aggre-
gates. In The VLDB Journal The International Journal on Very Large Data Bases 12 (3),
pp. 262–283. DOI: 10.1007/s00778-003-0107-z.
Zhang, D.; Markowetz, A.; Tsotras, V.; Gunopulos, D.; Seeger, B. (2001): Efficient computation
of temporal aggregates with range predicates. In P. Buneman (Ed.): The 12th ACM SIGMOD-
SIGACT-SIGART Symposium. Santa Barbara, California, United States, pp. 237–245.
Zhang, D.; Markowetz, A.; Tsotras, V. J.; Gunopulos, D.; Seeger, B. (2008): On computing tem-
poral aggregates with range predicates. In ACM Trans. Database Syst. 33 (2), pp. 1–39. DOI:
10.1145/1366102.1366109.
232 Bibliography
Zhou, S. (2010): An Efficient Simulation Algorithm for Cache of Random Replacement Policy.
In D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mitchell et al. (Eds.):
Network and Parallel Computing, vol. 6289. Berlin, Heidelberg: Springer Berlin Heidelberg
(Lecture Notes in Computer Science), pp. 144–154.
Zimányi, E. (2006): Temporal aggregates and temporal universal quantification in standard
SQL. In SIGMOD Rec. 35 (2), pp. 16–21. DOI: 10.1145/1147376.1147379.