Philipp Meisen Analyzing Time Interval Datarepo.desakupemalang.id/materi/Internet-of-Things/Analyzing Time... · Common systems are not capable to analyze these amounts of time inter-

Analyzing Time Interval Data

Philipp Meisen

Introducing an Information System for Time Interval Data Analysis

Analyzing Time Interval Data

Philipp Meisen

Analyzing Time Interval DataIntroducing an Information System for Time Interval Data Analysis

Philipp MeisenAachen, Germany

ISBN 978-3-658-15727-2 ISBN 978-3-658-15728-9 (eBook) DOI 10.1007/978-3-658-15728-9

Library of Congress Control Number: 2016952631

Springer Vieweg © Springer Fachmedien Wiesbaden GmbH 2016 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made.

Printed on acid-free paper

This Springer Vieweg imprint is published by Springer NatureThe registered company is Springer Fachmedien Wiesbaden GmbHThe registered company address is: Abraham-Lincoln-Str. 46, 65189 Wiesbaden, Germany

D82 (Diss. RWTH Aachen University, 2015)

Acknowledgments

For Edison and Isaac

First of all, I want to thank all the people that helped me making this

work possible. Especially, I want to mention Sabina Jeschke for her super-

vision and advice, my managing director, friend, and brother Tobias Meisen

for sharing his knowledge and experience and pushing me whenever

needed, my co-worker and friend Christian Kohlschein for listening, having

endless discussions and reviewing my work, Angelika Reimer for creating

the illustrations, and Diane Wittman for helping me formatting the book.

I also want to give some special thanks and dedications to the people,

which follow me my whole life like my own shadow. My elder brother Holger,

who helped me whenever I was in doubt, my already mentioned twin-

brother Tobias for all the “Schokostreuselbrötchen” and discussions, my

parents for making all this possible by having, loving, and supporting me,

and also my dearest friends Tummel, Hoomer, Christian, Diane, and Marco

for every talk, time-out, and drink we had. Thank you all, for being there for

me whenever needed.

Last but not least, I want to express my deepest gratitude to my wife

Deborah for her support whenever it was needed. Without her this work

would never have been possible.

Philipp

Abstract

Time interval data is data which associates information with a specific time

range (i.e., a time window) defined by a start- and an end time point. Thus,

time intervals are a generalization of time points, i.e., each time point is a

time interval having the same start- and end time point. Nowadays, huge

sets of time interval data is collected in various situations, e.g., personnel

deployment, equipment usage, process control, or process management.

Common systems are not capable to analyze these amounts of time inter-

val data. Questions like “How many resources were utilized on Mondays in

an annual average” or “Which days overlap with the planning and which

are diametrically” cannot be answered utilizing modern systems or need

extensive data integration processes.

In this thesis, a model to analyze time interval data (TIDAMODEL) is in-

troduced. Based on this model, a query language (TIDAQL) is defined,

which can be utilized to answer complex questions as presented in the

previous chapter. Furthermore, a similarity measure based on different

types of distance measures (TIDADISTANCE) is presented. This similarity

measure enables users to search for similar situations within a time interval

database. The different solutions are combined to design and realize the

central result of the thesis, i.e., an information system to analyze time in-

terval data (TIDAIS). The introduced system utilizes different, bitmap based

indexes, which enable the system to handle huge amounts of data.

The results of the evaluation show that the presented implementation

fulfills the requirements formulated by different stakeholders. In addition, it

outperforms state-of-the-art solutions (e.g., solutions based on the Oracle

database management system, icCube, or TimeDB).

Zusammenfassung

Zeitintervalldaten sind Daten welche innerhalb eines Zeitfensters, d.h. zwi-

schen einem Start- und Endzeitpunkt, erfasst werden und eine Verallge-

meinerung von Zeitpunktdaten darstellen. Heutzutage werden immer häu-

figer große Mengen von Zeitintervalldaten in Bereichen wie z.B. der Perso-

naldisposition, Gerätenutzung, Prozesssteuerung oder Planung erfasst.

Die Auswertung von diesen Daten stellt gängige Analysesysteme vor

große Herausforderungen. Fragestellungen wie „Wie viele Ressourcen

wurden im Jahresdurchschnitt montags über den Tag verteilt in der Ferti-

gung benötigt?“ oder „Welche Tage sind bzgl. der Planung am genausten

und welche verlaufen diametral“ können meistens mit modernen Systemen

gar nicht modelliert oder nur durch Verwendung von langwierigen Integra-

tionsprozessen beantwortet werden.

In dieser Arbeit wird zunächst eine auf diskreten Zeitachsen basierende

Modellierung (TIDAMODEL) vorgestellt. Basierend auf dieser Modellierung

wird im Weiteren eine Anfragesprache (TIDAQL) definiert, welche die Be-

antwortung komplexer Fragestellungen, wie weiter oben angedeutet, er-

möglicht. Neben der Beantwortung von Fragen ist die Suche nach ähnli-

chen Gegebenheiten eine wichtige Eigenschaft von Informationssystemen.

Um diese Ähnlichkeitssuche zu ermöglichen, wird in der Arbeit ein Ähn-

lichkeitsmaß (TIDADISTANCE) präsentiert. Diese einzelnen vorgestellten

Teilergebnisse werden genutzt, um das zentrale Ergebnis der Arbeit, ein

Informationssystem zur Analyse von Zeitintervalldaten (TIDAIS), zu entwer-

fen und zu realisieren. Das vorgestellte System basiert dabei auf Bitmaps,

welche die Auswertung von großen Datenmengen von Zeitintervalldaten

ermöglicht. Die Evaluierungsergebnisse zeigen, dass das vorgestellte Sys-

tem andere Lösungen (z.B. Lösungen die auf icCube, TimeDB oder mo-

derne Datenbankmanagementsysteme wie Oracle basieren) bzgl. der Aus-

wertungsperformanz übertrifft.

Table of Contents

Acknowledgments V

Abstract VII

Zusammenfassung IX

Table of Contents XI

List of Abbreviations XV

List of Figures XIX

List of Tables XXV

List of Listings XXVII

List of Definitions XXXI

1 Introduction and Motivation 1

2 Time Interval Data Analysis 7

2.1 Time 7

2.1.1 Time Intervals 7

2.1.2 Time Interval Data Aggregation 10

2.1.3 Temporal Models 14

2.1.4 Temporal Operators 20

2.1.5 Temporal Concepts 22

2.1.6 Special Characteristics of Time 23

2.2 Features of Time Interval Data Analysis Information System 29

2.2.1 Analytical Capabilities 30

2.2.2 Time Interval Data Analysis Process 35

2.2.3 User Interface, Visualization, and User Interactions 42

2.3 Summary 43

3 State of the Art 45

3.1 Analytical Information Systems 45

3.2 Analyzing Time Interval Data: Different Approaches 46

3.2.1 On-Line Analytical Processing 47

3.2.2 Temporal Pattern Mining & Association Rule Mining 52

3.2.3 Visual Analytics 54

XII Table of Contents

3.3 Performance Improvements 56

3.3.1 Indexing Time Interval Data 56

3.3.2 Aggregating Time Interval Data 60

3.3.3 Caching Time Interval Data 61

3.4 Analytical Query Languages for Temporal Data 62

3.5 Similarity of Time Interval Data 67

3.6 Summary 70

4 TIDAMODEL: Modeling Time Interval Data 73

4.1 Time Axis 73

4.2 Descriptors 76

4.3 Time Interval Database 80

4.4 Dimensional Modeling 82

4.5 Summary 87

5 TIDAQL: Querying for Time Interval Data 91

5.1 Data Control Language 92

5.2 Data Definition Language 95

5.3 Data Manipulation Language 96

5.3.1 Insert, Delete, & Update Statements 97

5.3.2 Get & Alive Statements 99

5.3.3 Select Statements 100

5.4 Summary 108

6 TIDADISTANCE: Similarity of Time Interval Data 111

6.1 Temporal Order Distance 113

6.2 Temporal Relational Distance 115

6.3 Temporal Measure Distance 117

6.4 Temporal Similarity Measure 118

7 TIDAIS: An Information System for Time Interval Data 121

7.1 System’s Architecture, Components, and Implementation 121

7.1.1 Data Repository 125

7.1.2 Cache & Storage 127

7.2 Configuration 129

Table of Contents XIII

7.2.1 Model Configuration 130

7.2.2 System Configuration 145

7.3 Data Structures & Algorithms 149

7.3.1 Model Handling 150

7.3.2 Indexes 156

7.3.3 Caching & Storage 165

7.3.4 Aggregation Techniques 167

7.3.5 Distance Calculation 171

7.4 User Interfaces 176

7.5 Summary 178

8 Results & Evaluation 181

8.1 Requirements & Features 181

8.2 Performance 187

8.2.1 High Performance Collections 188

8.2.2 Load Performance 189

8.2.3 Selection Performance 190

8.2.4 Distance Performance 196

8.2.5 Proprietary Solutions vs. TIDAIS 197

8.3 Summary 201

9 Summary and Outlook 203

Appendix 205

Pipelined Table Functions (PL/SQL Oracle) 205

A Complete Sample Model-Configuration-File 206

A Complete Sample Configuration-File 211

Detailed Overview of the Runtime Performance 215

3-NN of the Temporal Relational Similarity 217

Bibliography 219

List of Abbreviations

AD Active Directory

AIS Analytical Information System

AJAX Asynchronous JavaScript and XML

ANSI American National Standards Institute

ANTLR Another Tool for Language Recognition

API Application Programming Interface

ARTEMIS Assessing coRrespondence of Temporal Events Measure for

Interval Sequences

BI Business Intelligence

CET Central European Time (time zone)

CPU Central Processing Unit

CSS Cascading Style Sheets

CSV Comma Separated Value

DBMS Database Management System

DCL Data Control Language

DDL Data Definition Language

DML Data Manipulation Language

DSS Decision Support System

DST Daylight Saving Time

DTW Dynamic Time Warping

DW Data Warehouse

JDBC Java Database Connectivity

JMS Java Message Service

JSON Java Simple Object Notation

GB Giga Byte

GIS Geographic Information System

GPU Graphics Processor Unit

GTA General Temporal Aggregation

GUI Graphical User Interface

HCC Hybrid Columnar Compression

XVI List of Abbreviations

HOLAP Hybrid OLAP

HTML HyperText Markup Language

HTTP Hypertext Transport Protocol

IBSM Interval Based Sequence Matching

ISO International Organization for Standardization

ITA Instant Temporal Aggregation

k-NN k-nearest neighbors

LDAP Lightweight Directory Access Protocol

LRU Least Recently Used (cache algorithms)

MB Mega Byte

MDX Multidimensional Expressions

MOLAP Multidimensional OLAP

MRU Most Recently Used (cache algorithms)

MWTA Moving-Window Temporal Aggregation

NoSQL Not Only SQL

OLAM On-Line Analytical Mining

OLAP On-Line Analytical Processing

PDT Pacific Daylight Time (time zone)

PL/SQL Procedural Language/Structured Query Language

POJO Plain Old Java Object

ROLAP Relational OLAP

RR Random Replacement (cache algorithms)

RQ Research Question

SQL Structured Query Language

STA Span Temporal Aggregation

SVG Scalable Vector Graphics

TAT Two-step Aggregation Technique

TIDA Time Interval Data Analysis

UI User Interface

UTC Coordinated Universal Time (time zone)

XML Extensible Markup Language

List of Abbreviations XVII

XSD XML Schema Definition

XSLT Extensible Stylesheet Language Transformation

List of Figures

Figure 2.1 Apple falling from tree, example of a time interval and as-

sociated information observed, measured or calculated

during the process of an apple falling from a tree. 8

Figure 2.2 Machine performance, example of a time interval and as-

sociated information observed, measured, or calculated

during the execution of a task by a machine. 9

Figure 2.3 Example of ITA and MWTA (temporal aggregation forms

creating constant intervals). 12

Figure 2.4 Example of STA and TAT (temporal aggregation forms

creating constant intervals). 13

Figure 2.5 Overview of the different aspects of a temporal model. 15

Figure 2.6 The fall property using a discrete (left) and continuous

(right) temporal model. Within the discrete chart, the

diamonds mark the value of the property and the

triangles illustrate the indivisible delta between the

previous and the current time point. 16

Figure 2.7 The item property using a discrete (left) and continuous

(right) temporal model. Within the discrete chart, the

diamonds mark the value of the item property and the

triangles illustrate the indivisible delta between the

previous and the current time point. 17

Figure 2.8 Example of a mapping between data of a circular

temporal model to a linear temporal model. 19

Figure 2.9 Selection of a time window from an unbounded

temporal model to be presented and analyzable in

a bounded temporal model. 20

Figure 2.10 Overview of Allen’s (1983) temporal operators. 20

XX List of Figures

Figure 2.11 Illustration of the ambiguousness of Allen’s (1983)

temporal operators. 21

Figure 2.12 Examples of commonly used temporal concepts. 22

Figure 2.13 Example of the impact of different time zones within the

scope of temporal analytics. 24

Figure 2.14 Illustration exemplifying the error of calculating

statistical values, e.g., the amount of intervals per hour. 25

Figure 2.15 Overview of selected features defined in the category

descriptive analytics in the context of time interval data

analysis (cf. Table 2.1). 33

Figure 2.16 The data science process following Schutt, O'Neil

(2014). 36

Figure 2.17 The result of the workshops regarding the time interval

data analysis process. 38

Figure 3.1 Examples of the different types of hierarchies

(non-strict, non-covering, and non-onto). 48

Figure 3.2 Two examples of the summarizability problem. 49

Figure 3.3 Illustration of a scenario covered I-OLAP as presented

by Koncilia et al. (2014). 51

Figure 3.4 Examples of the visualization techniques Cluster

Viewer (van Wijk, van Selow 1999) and GROOVE

(Lammarsch et al. 2009). 55

Figure 3.5 Example of a bitmap-index containing three bitmaps,

one for each possible value (i.e., red, green, and

yellow) of the color-property. 58

List of Figures XXI

Figure 3.6 Illustration of the question to be answered by the

query: "How many resources are needed within

each hour of the first of January 2015?" 63

Figure 3.7 Comparison of the result of the query from a system

supporting non-strict relationships (right) and one that

does not (left). 64

Figure 3.8 The ARTEMIS distance calculated for two interval-sets

S and T. 68

Figure 3.9 The DTW distance calculated for two interval-sets S

and T. 69

Figure 3.10 Example of the IBSM distance calculated for two

interval-sets S and T. 70

Figure 4.1 Illustration of a time-axis = (time,minute). The

incoming data, i.e., timestamps (in milliseconds)

between 2000-01-01 00:00:00.000 and 2099-12-31

23:59:59.999 from the time zone CET, are mapped

to values 1-10 representing minutes. 76

Figure 4.2 Example of a descriptor dlang = (lang, lang, lang), which

uses an identity function to map the set of languages,

i.e., the descriptive values, to the descriptor values. 80

Figure 4.3 An example of a time interval database = (data, time,

team, department). The database contains tasks

performed by teams (a team consists of several team

members) and for the specified department. 82

Figure 4.4 Example of two descriptor hierarchies. The one on the

left is based on the descriptor values specified by country

and the one on the right is based on city. The example

shows a non-strict (left) and a non-covering hierarchy

XXII List of Figures

(right). Both hierarchies are valid regarding the

definition of descriptor hierarchies. 84

Figure 4.5 Example of implicit information recognized for the

timestamp 2000-01-06 13:00 CET and the validity of

the information when rolling up a hierarchy. 85

Figure 4.6 Example of implicit information recognized for the

timestamp 2000-01-06 13:00 CET and the validity of

the information when rolling up a hierarchy. 86

Figure 4.7 Illustration of the TIDAMODEL showing all defined ele-

ments. 88

Figure 5.1 Illustration of the provided temporal operators and

there corresponding temporal relation. 103

Figure 5.2 Sample dimension showing one of two hierarchies with

three levels. 106

Figure 5.3 Usage of the query language features ON and

GROUP BY to enable roll-up and drill-down operations. 109

Figure 6.1 Overview of the different types of similarity types,

presenting an equality example for each type of

measure. 112

Figure 6.2 Illustration of two different matching strategies, i.e.,

weekday and order match. 113

Figure 6.3 Example of assignments of relations to time points

using Allen's (1983) relations. 116

Figure 7.1 The architecture of the information system showing

the high-level components. 122

Figure 7.2 Detailed architecture of the data repository component. 126

List of Figures XXIII

Figure 7.3 Illustration of the subcomponents of the main

component Cache & Storage. 128

Figure 7.4 The complete package of the DbDataRetriever

extension used to load data from a database. 133

Figure 7.5 Illustration of the first three levels (from bottom to top)

of the hierarchy defined in Listing 7.7. 139

Figure 7.6 Illustration of the hierarchy defined in Listing 7.8. 140

Figure 7.7 Three different time axis configurations and an

illustration of the internal representation as array. 151

Figure 7.8 Illustration of the algorithm used to map descriptive

values, e.g., [flu, cold] to the descriptor values flu and

cold. 154

Figure 7.9 Example of a result of the processing of a raw data

record. 155

Figure 7.10 Illustration of the index structure (HashMap) used by

the descriptors index (cf. Goodrich, Tamassia (2006)). 157

Figure 7.11 The different tasks (filtering, partitioning, and

aggregating) to be performed to handle an analytical

query. 158

Figure 7.12 The data descriptor index, using by default a HashMap

and a high performance collection (Trove) to index

bitmaps. 160

Figure 7.13 Example of the structure of the fact descriptor index,

associating facts with descriptor values. 161

Figure 7.14 An example database with data related indexes. 163

XXIV List of Figures

Figure 7.15 Illustration of the group bitmap calculation, in the case

of the usage of a dimension’s level within the group by

expression. 165

Figure 7.16 The four resulting bitmaps for the different chronons

and groups. 168

Figure 7.17 Illustration of TAT and STA. 171

Figure 7.18 Illustration of the abort criterion for the temporal order

and measure distance. 173

Figure 7.19 Illustration of the algorithm used to determine the

relations between intervals. 174

Figure 7.20 Overview of the user console of the implemented UI:

top-left shows the login screen, top-right is a screenshot

of the model management, middle-left is a picture of the

data management, middle-right illustrates the user man-

agement, and the screenshots on the bottom show the

time series visualization (left) and the Gantt-chart

(right). 177

Figure 8.1 The results of the tests regarding the high performance

collections for int and long data types. 188

Figure 8.2 The results of the load performance tests. 190

Figure 8.3 The results of the selection tests for the different

queries shown in Table 8.3. 195

Figure 8.4 Illustration of the performance tests regarding the

distance calculation, as well as the results of the

temporal order and measure similarity; a visualization

of the relational similarity can be found in the appendix. 197

Figure 8.5 Performance results of the queries used to answer the

questions shown in Table 8.4. 201

List of Tables

Table 2.1 Overview of the features requested in the category de-

scriptive analytics. 31

Table 2.2 Overview of the features requested in the category

predictive analytics. 34

Table 2.3 Overview of the features requested in the category

prescriptive analytics. 35

Table 2.4 List of requested features for the information system

considering data collection. 39

Table 2.5 List of requested features for the information system

considering data integration & cleansing. 40

Table 2.6 The features required to support the application of

models and analytical algorithms. 42

Table 2.7 Overview of the features requested for the UI,

visualization, and user interaction. 42

Table 5.1 Overview of the seven criteria used as basis for design

decisions regarding a query language. 91

Table 6.1 Overview of the time points calculation for a specific

relation. 116

Table 7.1 Results of the default temporal mapping algorithm,

assuming the top time axis definition of Figure 7.7. 152

Table 7.2 Examples of different group-bitmaps created for

specific GROUP BY expressions based on the

example database shown in Figure 7.14. 164

Table 7.3 List of algorithms used to calculate the different

aggregated values. 169

XXVI List of Tables

Table 8.1 Overview of the different features requested, the

realization of the feature, as well as comments of the

users (if available), and the degree of realization. 182

Table 8.2 List of algorithms used to calculate the different

aggregated values. 187

Table 8.3 Overview over the different tests performed to

validate the runtime performance. 193

Table 8.4 List of tests performed in the category "Proprietary

Solutions vs. TIDAIS". 200

List of Listings

Listing 3.1 MDX statement used to answer the question regarding

the needed resources. 63

Listing 3.2 ATSQL2 statement used to answer the question

regarding the needed resources. 65

Listing 3.3 SQL statement used to answer the question regarding

the needed resources. The presented solution is based

on additional PL/SQL functions and data types which are

shown in the appendix (cf. Pipelined Table Functions

(PL/SQL Oracle)). 66

Listing 3.4 The TIDAQL statement used to answer the question

regarding the needed resources. 67

Listing 5.1 Syntax of statements using the ADD command of the

DCL to add a user or a role. 93

Listing 5.2 Syntax of statements of the DCL, used to drop a user

or a role. 93

Listing 5.3 Syntax of the statements using the commands

MODIFY, GRANT, and REVOKE. 94

Listing 5.4 Syntax of statements for the commands ASSIGN and

REMOVE, used to modify the roles assigned to a user. 94

Listing 5.5 Syntax of statements using the LOAD, UNLOAD, and

DROP commands of the DDL. 95

Listing 5.6 Syntax of statements using the INSERT command

of the DML. 97

Listing 5.7 Syntax of the statement to enable or disable bulk load

for a model. 99

XXVIII List of Listings

Listing 5.8 Syntax of the statement to delete a specified record

from a model. 99

Listing 5.9 Syntax of statements using the UPDATE command

of the DML. 99

Listing 5.10 Syntax of statements using the GET command of the

DML. 100

Listing 5.11 Syntax of the select statement to retrieve time series

of a specified time window. 101

Listing 5.12 Syntax of the select statement to retrieve time

interval records from the information system. 102

Listing 5.13 Syntax of the select statement to retrieve analytical

results from the information system. 104

Listing 7.1 The skeleton of a model-configuration-file of the

information system. 130

Listing 7.2 Configuration of a data retriever within a model. 131

Listing 7.3 Configuration of a dataset and the structure of the set. 132

Listing 7.4 XSLT template used to create the bean used by the

DbDataRetriever to define the query. 133

Listing 7.5 An excerpt of a configuration defining three descriptors

and descriptor values for one of the descriptors. 135

Listing 7.6 An example of a configuration of the time axis. 136

Listing 7.7 A sample definition of a time hierarchy within the

time dimension. 138

Listing 7.8 A sample definition of a hierarchy of the descriptor

WORKAREA. 140

List of Listings XXIX

Listing 7.9 A pre-processor configuration using the

ScriptPreProcessor. 141

Listing 7.10 A configuration specifying three sample schedules. 142

Listing 7.11 Example of a configuration of caches for all entities

of the system. 143

Listing 7.12 An example configuration of the default IndexFactory,

specifying the implementations used to index specific

data types. 144

Listing 7.13 The skeleton of a configuration-file of the information

system. 145

Listing 7.14 A sample configuration of the Authentication &

Authorization component. 146

Listing 7.15 Example of the system configuration of the Service

Handler component. 147

Listing 7.16 Example of the system configuration of the Query

Parser & Processor component. 147

Listing 7.17 Example of the system configuration to add an

additional template. 148

Listing 7.18 The pairing function used to determine a unique

identifier for a pair of intervals. 175

Listing 8.1 The naïve algorithm. 191

Listing 8.2 The IntTreeB algorithm. 192

List of Definitions

Definition 1 TIDAMODEL 73

Definition 2 Valid time points, chronon, and data time points 73

Definition 3 Temporal mapping function 74

Definition 4 Granularity 75

Definition 5 Time axis 75

Definition 6 Descriptive attribute and descriptive value 76

Definition 7 Set of and descriptor value 77

Definition 8 Descriptive mapping function 78

Definition 9 Fact function (value-invariant, record-invariant,

record-variant) 79

Definition 10 Descriptor 79

Definition 11 Time interval 80

Definition 12 Time interval dataset and time interval record 81

Definition 13 Time interval database 81

Definition 14 Descriptor dimension, hierarchies, levels, and members 83

Definition 15 Time dimension, hierarchies, levels, and members 87

Definition 16 Dimensions 87

Definition 17 Temporal Order Distance 114

Definition 18 Temporal Relational Distance 117

Definition 19 Temporal Measure Distance 117

Definition 20 Temporal Similarity Measure 118

1 Introduction and Motivation

The process of analyzing data has raised increased attention in recent

years. Data analysis techniques are used to recommend articles to us-

ers, predict the outcome of elections, or understand causes. Over the

last years, discussions with industrial partners and feedback from sev-

eral companies showed that the analysis of time interval data created

various problems, across different domains. Thus, the focus of this

book is on an information system capable to analyze a specific, con-

tent-independent type of data; time interval data1.

To understand the issues arising when using available, proprietary sys-

tems and to understand the requirements posed by analyst regarding an

information system to analyze time interval data, several workshops with

analyst from different domains were held over the last years. The users

participated were dealing with time interval data on a daily basis in different

domains, e.g., aviation (e.g., KLM, Delta Airlines, Lufthansa, Bologna Air-

port, or Düsseldorf Airport), logistics (e.g., DHL, FedEx, or Dnata), call cen-

ters, and hospitals (e.g., university hospital Aachen, Bonn, or Düsseldorf),

as well as linguists (e.g., experts from the RWTH University, the Centre for

Research and Innovation in Translation and Translation Technology in Den-

mark, or the VU Amsterdam University) and production workers (e.g., Audi,

Continental, or Porsche). The results of these workshops indicate that a

need for an information system to analyze time interval data is present and

that the main reasons, why available systems are not suitable, are:

– unsupported handling of temporal aspects (e.g., time zones, temporal

relations, or daylight saving time),

– performance issues (e.g., analyzing millions of intervals or using a low-

est granularity of seconds),

1 source-code: https://github.com/pmeisen, binary-version: http://tida.meisen.net

© Springer Fachmedien Wiesbaden GmbH 2016P. Meisen, Analyzing Time Interval Data,DOI 10.1007/978-3-658-15728-9_1

2 1 Introduction and Motivation

– limitations of available modeling capabilities (e.g., unsupported many-

to-many relations, unavailable aggregation functions like median, or

complex measures),

– unsustainable and expensive data integration processes (e.g., creating

enormous amounts of redundant data or discretizing the intervals), and

– faulty results (e.g., incorrect aggregation outcomes).

Over the past years and decades, several disciplines like data mining

(Moerchen 2009; Laxman, Sastry 2006), artificial intelligence (Allen 1983),

music (Bergeron, Conklin 2011), medicine (Combi et al. 2007; Aigner et al.

2012), finance (Arroyo et al. 2010), ergonomics (Boonstra-Hörwein et al.

2011), or cognitive science (Berendt 1996) have presented general or ap-

plication specific techniques or methods dealing with time interval data2. In

simple terms, a time interval is given by two time points on an underlying

time axis, i.e., [t1, t2] with t1 ≤ t2. Time interval data is recorded, collected, or

generated in various situations and industrial fields e.g. workload retrieved

from the records of man hours, tasks planned in a project, actions executed

during a process, or event intervals noticed during an observation.

In general, analyzing is defined as "a careful study of something to learn

about its parts, what they do, and how they are related to each other" (Mer-

riam-Webster 2015). Current research concerning the analysis of time in-

tervals addresses specific problems like pattern3 or association rule mining

(Winarko, Roddick 2007; Papapetrou et al. 2009; Sadasivam, Duraiswamy

2013), comparison (Kostakis et al. 2011; Kotsifakos et al. 2013), visualiza-

tion and interaction (Aigner et al. 2011; Heuer, Jr., Pherson 2014), model-

ing (Koncilia et al. 2014; Meisen et al. 2014), or pre-processing (Kimball,

Ross 2002). Some of the techniques or methods consider the fact of han-

dling time interval data (instead of just interval data) to motivate the usage

2 Some literature refer to the term time intervals also as temporal intervals, event-intervals, interval-based events, time segments, time ranges, time periods, interval-based data, tasks or activities.

3 In some literature a pattern of time interval sequences is defined as an arrangement.


of a temporal semantic (e.g. Allen’s scheme (Allen 1983)) which is im-

portant so that terms like coincidence or synchronicity are well-defined.

Others use statistics like aggregated facts (e.g. yearly population, average

monthly temperatures, or yearly energy consumption per industrial sector)

from temporal data to enable a comparison between different days, months

or years to measure quality (e.g., using key performance indicator).

A holistic solution, like an information system, addressing the problem

of analyzing time intervals, has to consider aspects of modeling and per-

sistence, visualization and interaction, comparison, aggregation and min-

ing. An analyst must be able to e.g. visualize and compare results, select

specific intervals, or find typical matches and discrepancies. The system

has to handle the time interval data in such a way that a first result (e.g., in

form of a trend or projection) is calculated fast and thereby can be modified

early by the analyst if needed. Furthermore, time-aspects, which may be

considered to be irrelevant or just not recognized by context-free generic

algorithms, must be taken into account. Those aspects could be e.g. holi-

days, time zones, or daylight saving, but also vacation periods, leap years,

calendar weeks which do not fit neatly into months or years, or the usage

of a financial instead of calendar year. Summarized, it can be stated that

an information system has to enable the analyst to get answers to ques-

tions and point out possible interest which arise across the whole analyzing

process. Depending on the context of the analysis it also has to support

the generation of generic representations (e.g., detect patterns), or the

comparison of a set of time intervals using a specified distance measure

(e.g., complex search).

More specifically, the following research questions (RQ) are the focus

of this book:

1. Which features must be supported by an information system to enable

time interval data analysis?

2. Which aspects must be covered by a time interval data analysis model

and how can it be defined?

4 1 Introduction and Motivation

3. How can a query language for the purpose of analyzing time interval

data, i.e., select, filter, aggregate, generalize, or specialize be formu-

lated?

4. Which indexing techniques can be used to process user queries and

how should data be cached, as well as persisted?

5. What similarity measure can be used to compare time interval da-

tasets, enabling the search for similar subsets?

6. How should the architecture of an information system for time interval

data analysis be realized, how should the system be configured, and

which interfaces have to be provided to support the analyzing process?

The questions arose during the studies, implementation, and realization of

the introduced information system are used as a guideline. Each question

will be addressed and answered within this book. The book is structured as

follows: Chapter 2 describes the term of time interval data analysis by in-

troducing several characteristics of and terminologies used in the context

of time (cf. section 2.1). In addition, the chapter presents requirements for

and the derived features of an information system demanded by analysts

dealing with time interval data on a regular basis. Furthermore, these re-

quirements and features are used to identify different research areas, im-

portant to be examined in the context of time interval data analysis (cf. sec-

tion 2.2). Chapter 3 reflects the state of the art of the identified research

areas, i.e., proven architectures used for information systems in the context

of data analysis (cf. section 3.1), different approaches applied in data anal-

ysis (cf. section 3.2), indexing and aggregation of time interval data (cf.

section 3.3), as well as similarity and comparison of sets (cf. section 3.5).

The chapters 4, 5, 6, and 6.2 present the aspects relevant to create an

information system to analyze time interval data. These aspects are: the

defined model TIDAMODEL, the query language TIDAQL, the similarity

measure TIDADISTANCE, and selected parts (e.g., the architecture) of the

realized information system TIDAIS. Each chapter is divided in multiple sec-

tions, discussing the important characteristics and results of the chapter’s

topic. Moreover, the different research questions mentioned previously in


this chapter are answered. The presented solutions are evaluated and dis-

cussed in chapter 7.5. In section 8.2, the performance of different imple-

mentations is evaluated. In addition, the presented solution is evaluated

regarding the defined set of features (cf. section 8.1), its performance (cf.

section 8.2), and compared to commercial solutions (e.g., database man-

agement systems (DBMS) or business intelligence solutions (BI)). The

book concludes with an outlook in chapter 9.

2 Time Interval Data Analysis

This chapter is structured as follows: Section 2.1 introduces terms and tem-

poral aspects relevant to be considered when analyzing time interval data.

In section 2.2, the different features required by an information system are

discussed. The introduced terms, temporal aspects, and presented fea-

tures are results of several workshops with users from different domains

(e.g., service providers like ground-handlers, airlines, call centers, and hos-

pitals, as well as linguists and production workers) and aligned with an ex-

tended literature research. The chapter is completed with a summary in

section 2.3.

2.1 Time

When referring to time within the context of information systems and ana-

lytics, it is necessary to utilize a temporal framework. A temporal framework

defines how time is represented (i.e., temporal models, section 2.1.3), how

time can be used (i.e., temporal operators, section 2.1.4), and which se-

mantic is applied (i.e., temporal concepts, section 2.1.5). In addition, con-

straints and limitations are implicitly defined within a temporal framework,

i.e., circumstances which cannot be formalized are assumed to be invalid.

In order to motivate a temporal framework in the context of time interval

data analysis, section 2.1.1 introduces the term time interval informally (a

formal definition is given in section 4.3) and in section 2.1.2 the aggregation

of time intervals is presented, which is the predominant operator in the field

of data analysis (cf. section 2.2.1 and section 7.3.4). Lastly, special char-

acteristics of time like leap years, daylight saving, or time zones are dis-

cussed in section 2.1.6.

2.1.1 Time Intervals

A time interval can be specified by two endpoints (e.g., tstart and tend, with

tstart ≤ tend). Generally, the interval’s endpoints can be included or excluded,

denoting the former by rounded and the latter by square brackets. As an


8 2 Time Interval Data Analysis

example, the denotation [10:00, 12:12) is used to specify all time points

between 10:00 (included) and 12:12 (excluded). In real life, time intervals

are used to express the validity of, e.g., an observation, a state, or of a

more complex situation, over a period of time:

– The red apple with a weight of 250.00g was falling from the tree be-

tween 09:45:12 and 09:45:57.

– The accused was out on bail from the first of January 2015 until the

fifth.

– The machine only produced 16 items between 09:00 and 12:28, even

though it could have produced 25.

– The translator typed the word ‘treasure’ and looked up the word

‘Schatzinsel’ within two minutes.

Looking at these sentences reveals some peculiarities to be considered

when working with time intervals. For example, it may be impossible to tell

if the endpoints are in- or excluded or if they are absolute (e.g., 01/01/2015)

or relative (e.g., "within two minutes"). In addition, the granularity used to

express an endpoint may differ (e.g., 09:00 uses a minute granularity,

whereas the granularity of 09:45:12 is seconds). Furthermore, the exam-

ples indicate that the provided information used to describe can vary (e.g.,

"red apple" as categorization vs. "16 items" as fact). Figure 2.1 illustrates a

first example of a time interval and different types of associated infor-

mation.

Figure 2.1 Apple falling from tree, example of a time interval and associated information observed, measured or calculated during the process of an apple falling from a tree.

2.1 Time 9

The example shown in Figure 2.1 illustrates an observation which

started at 09:45:12 and ended at 09:45:57 (i.e., a time interval of [09:45:12,

09:45:57]). During (or after) the observation the properties color, class,

weight, fall, and duration were measured. Without providing a formal clas-

sification at this point (c.f. section 4.2), it is noticeable that properties may

have to be handled differently from a semantical and analytical point of

view. For example, the property color can be of interest when filtering,

whereas the property class may be useful to determine a price, which can

be important when aggregating. Other interesting properties are those

which are not constants within the interval, e.g., the property fall is not con-

stant. The presented value of 1.00 m is only valid for time points t ≥ tend. For

time points tstart > t > tend, the property’s value can be calculated using the

formula fall = ½ · g · (t – tstart)2 and for t ≤ tstart the value is 0.00 m.

Another example is shown in Figure 2.2. The example illustrates tasks

(i.e., time intervals) performed by a machine. Such an example can

typically be found in production environments.

Figure 2.2 Machine performance, example of a time interval and associated in-formation observed, measured, or calculated during the execution of a task by a machine.

The time interval of the machine performance example uses, compared

to the previously discussed apple falling from tree example, a minute gran-

ularity for the time interval, i.e., [09:00, 12:30]. The example defines four


properties associated to the time interval: machine, items, maximal capac-

ity, and needed resources. The items property is not constant (i.e., the

value changes during the interval), whereby the maximum capacity prop-

erty may be assumed to be constant (e.g., when filtering) or not (e.g., when

used to calculate the utilization of the machine over time). In addition, the

needed resources property is of special interest regarding aggregation. As

introduced further in section 3.2.1 and discussed in more detail in section

7.3.4, this property can lead to summarizability problems if not aggregated

correctly (Lenz, Shoshani 1997; Song et al. 2001; Mazón et al. 2008). The

reason lies in the indivisibility of the value, i.e., the value is 4 for every time

point of the interval but it is still 4 even if several time points of the interval

are selected (i.e., summarizability is not given).

Within the next sections, the introduced examples are used to exemplify

time interval data aggregation and are used to motivate the usage and ex-

emplify the impact of temporal models, concepts and operators.

2.1.2 Time Interval Data Aggregation

Data aggregation is the predominant operation in the field of data analysis

(Zhang et al. 2008). Aggregating time interval data is more difficult than the

aggregation of time point data. The reasons lie above all in the intricate

semantic (cf. section 2.1.4), e.g., an interval expresses typically the validity

of a fact or description over a period of time. When aggregating intervals

within a specified time window several questions have to be answered, e.g.,

"Should the time window be partitioned" (e.g., using a time window of a

year, it may be needed to aggregate data by day) or "What is the semantic

meaning of the aggregation and does it fulfill the expectation" (e.g., is count

a useful aggregation to determine the needed resources within a time win-

dow). In literature, different forms of temporal aggregations are introduced

in the field of temporal databases and data analysis, i.e., Instant Temporal

Aggregation (ITA), Moving-Window Temporal Aggregation (MWTA), Span

2.1 Time 11

Temporal Aggregation (STA), General Temporal Aggregation (GTA) (Böh-

len et al. 2008), as well as the Two-step Aggregation Technique (TAT)

(Meisen et al. 2015b).

When aggregating time interval data, the set of intervals to be grouped

is defined by the values of specified properties (e.g., the color of the apple

in the apple falling from tree example (cf. Figure 2.1)) and, in addition, by a

temporal grouping criterion (e.g., month, day or hour) used to partition the

time axis. Depending on the form of temporal aggregation, the returned

result of a query might contain so called constant intervals (ITA, MWTA,

and GTA) or fixed partitions (STA, TAT, and GTA). A constant interval is an

interval in which the aggregated value4 is constant, i.e., consecutive time

partitions are coalesced. Conversely, a fixed partition is defined by the

specification of the aggregation (e.g., group by month) and the result con-

tains a value for each partition (e.g., each month).

Figure 2.3 illustrates the ITA and MWTA forms, both returning constant

intervals. In the figure, the intervals are grouped by the machine property,

i.e., two groups are identified: furnace and machine. Furthermore, the time

axis is on month granularity, and the example counts the amount of ma-

chines per month. As mentioned, ITA and MWTA both create constant in-

tervals. Thus, in the case of ITA the results contains, e.g., the constant in-

terval [3, 5] for the value 2. On the other hand, MWTA uses a defined time

window [t – w, t – w’] for each instance t of the defined temporal grouping

and determines the set of intervals to be grouped. Thus, the example illus-

trated in Figure 2.3 calculates the aggregated values for the impeller group

and the different time windows are, e.g.: count([1, 2]) = 1, count([2, 3]) = 2,

count([3, 4]) = 2, …, count([11, 12]) = 1, and count([12, 12]) = 0. The cre-

ated constant values are shown in the table of the figure.

4 Some implementations consider lineage information, i.e., the implementation validates if

the resulting aggregated value is based on the same time intervals (cf. Böhlen et al. 2008).


Figure 2.3 Example of ITA and MWTA (temporal aggregation forms creating constant intervals).

In general, ITA uses the defined temporal grouping criterion to deter-

mine the set of intervals for a specific group. On the other hand, MWTA

uses a defined time window [t – w, t – w’] for each instance t of the defined

temporal grouping and determines the set of intervals to be grouped. Thus,

using MWTA with w = 0 and w’ = 0 leads to the same results as ITA pro-

vides. Empty groups are typically not included within the result (e.g., cf.

Figure 2.3: (impeller; 0; [12, 12]) and (furnace; 0; [12, 12]) are not included;

Snodgrass (1995), Böhlen et al. (2000)).

In contrast to ITA or MWTA, the application of STA or TAT leads to fixed

partitions. Consequently, the result contains one aggregated value for each

instance of the temporal grouping specified, if at least one time interval

overlaps with the instance. It depends on the chosen implementation, if the

2.1 Time 13

result contains empty groups or not. Meisen et al. (2015b) present a bit-

map-based implementation for TAT which ensures that the result contains

all empty groups. Regarding STA, empty groups are not included referring

to Snodgrass (1995) and Böhlen et al. (2000). Figure 2.4 illustrates STA

and TAT. As exemplified, STA determines the set of intervals for each in-

stance within the specified temporal grouping criterion (i.e., instance [1, 6]

overlaps with two intervals, whereas [7, 12] overlaps with three). The same

result could be achieved using TAT with a count operator. Within the exam-

ple shown in Figure 2.4, TAT applies the max-count operator. Thus, the ag-

gregated value of count is determined for each instance of the lowest gran-

ularity of the underlying time axis (i.e., for each chronon, cf. section 2.1.3).

Next, the results of each month are aggregated using the maximum oper-

ator (i.e., max). Therefore, the result for [7, 12] is 2 (i.e., max({2, 2, 2,

2, 2, 1})) instead of, compared to STA, 3.

Figure 2.4 Example of STA and TAT (temporal aggregation forms creating con-stant intervals).


The earlier mentioned, but so far not further discussed, GTA is a gen-

eralized framework for temporal aggregation accommodating ITA, MWTA,

and STA, as well as partly TAT. Generally, the framework allows specifica-

tion of any kind of partition over the time axis. In addition, it is possible to

define mapping functions in order to manipulate the instances of the parti-

tion. The framework covers TAT only partly because it only allows the defi-

nition of one aggregation function. Nevertheless, considering GTA, several

challenges have not been solved. In addition, GTA is a theoretical definition

which "offers a uniform way of expressing concisely the various forms of

temporal aggregation" and "does not imply an efficient implementation"

(Böhlen et al. 2008).

Temporal aggregations are discussed within this book several times:

section 2.2.1 introduces features which are required regarding temporal

aggregation, section 3.2.1 discusses the usage of temporal aggregators,

as well as summarizability problems. Chapter 5 introduces a query lan-

guage supporting the usage of temporal aggregations.

2.1.3 Temporal Models

In literature about time, various temporal models have been proposed to

represent physical time. Generally it can be stated that physical time can

be modeled as discrete, dense, or continuous (Dyreson et al. 1994; Hudry

2004). In addition, literature introduces other aspects namely linear,

branching, or circular temporal models, as well as bounded or unbounded

temporal models (Frühwirth 1996). Within this section, the different aspects

of a model will be introduced and discussed in matters of time interval data

analysis. Also, the usage of a discrete, linear, bounded temporal model in

the context of time interval data analyses is motivated. Figure 2.5 depicts

the different temporal models which are introduced in detail in this section.

2.1 Time 15

Figure 2.5 Overview of the different aspects of a temporal model.

Discrete, Dense, and Continuous Temporal Models

A discrete time implies that a point in time can be represented by an integer

(i.e., time is isomorph to the natural numbers). If a dense or continuous

temporal model is used, it infers that another time point between any two

‘unequal’ time points exists (i.e., time is isomorph to the rational or real

numbers)5. To understand the impact of the decision of which temporal

model to use, it is necessary to understand the main differences between

the different models considering the context of analyzing time interval data.

Because of the isomorphic behavior of dense and continuous temporal

models and the fields of application concerning dense temporal models

(i.e., mainly model checking), the following discussion will discuss the us-

age of a discrete or continuous temporal model, whereby dense temporal

models are – regarding the argumentation – ‘covered’ by the latter.

5 As stated by Hudry (2004), a dense temporal model is isomorphic to rational numbers,

whereby a continuous temporal model is isomorphic to the real numbers. In the context of analytics this differentiation is not important and is therefore not further mentioned.


To illustrate the differences between the temporal models, the apple

falling from tree example (cf. Figure 2.1) is used. Applying a discrete tem-

poral model to the example would let the apple ‘fall in steps’, i.e., at each

discrete time point the apple would have a different falling distance, i.e., the

fall property’s value would be different. The model would not clarify the ap-

ple’s position ‘in between’ two directly successive time points because in a

discrete temporal model something between two directly following time

points does not exist. Thus, within a discrete temporal model the falling

distance of the apple would be specified for each discrete time point of the

interval (e.g., at tend the apple’s falling distance is 1.00 m). Furthermore, it

would be possible to calculate an indivisible delta, which would be specified

by the absolute value of the difference of the falling distance of two directly

successive time points. Using a continuous temporal model, the falling dis-

tance would be specified for every moment t (using ½ · g · (t – tstart)2). A

delta between two time points can still be calculated but within such a

model the delta is not indivisible. Figure 2.6 illustrates the falling distance

in a discrete and continuous temporal model and shows the indivisible delta

calculated for the discrete case (triangles).

Figure 2.6 The fall property using a discrete (left) and continuous (right) tem-poral model. Within the discrete chart, the diamonds mark the value of the property and the triangles illustrate the indivisible delta be-tween the previous and the current time point.

2.1 Time 17

Regarding the apple falling from tree example, it may be intuitive to say

that the information available when using the continuous temporal model

is more precise. Nevertheless, looking at the machine performance exam-

ple and the items property, this intuition may be different. Figure 2.7 shows

the results recorded from an employee who checked the amount of created

items every 15 minutes using both the discrete and the continuous tem-

poral model.

Figure 2.7: The item property using a discrete (left) and continuous (right) tem-poral model. Within the discrete chart, the diamonds mark the value of the item property and the triangles illustrate the indivisible delta between the previous and the current time point.

In this example, the information provided by the continuous model is too

precise. Depending on the used function (e.g., if interpolation is used) it

may even be invalid6. From an analytical point of view, one may argue that:

‘as long as the granularity of a discrete time-axis is selected correctly, the

discrete temporal model is at least as good as the continuous one’. In ad-

dition, it has to be considered that data is typically collected by sensors

(using a discrete sampling rate). Thus, the measured data is discrete and

6 Figure 2.7 allows for the conclusion that the value of t = ½ · (t2 – t1) is 0.5. Such an invalid

value can be avoided by using a piecewise-defined continuous function. Nevertheless, from a domain-specific point of view, the correctness of the value is still not guaranteed because the employee did not check the amount at every time point.


the use of a continuous model is unnecessary. It should also be mentioned

that a continuous property (e.g., a value based on a mathematical function)

can be easily transformed into a discrete property using discretization tech-

niques (Liu et al. 2002). Another aspect that should be considered when

reaching a decision regarding a temporal model is the context. State of the

art indicates that analyses dealing with temporal data are mostly based on

discrete temporal models (cf. section 3.2).

As a result of these conclusions, the temporal model used within this

book is discrete. Thus, the time axis consists of a finite number of chronons

(i.e., "a nondecomposable [indivisible, remark of author] time interval of

some fixed, minimal duration" (Dyreson et al. 1994, p. 55)).

Linear, Branching, and Circular Temporal Models

Another aspect of temporal models addresses the future. Within a linear

temporal model only one future is assumed, whereby a branching temporal

model allows the existence of at least one but also multiple futures (paths).

Moreover, a circular temporal model defines the future to be recurring. In

the majority of cases regarding temporal data analysis, a linear temporal

model is used. This is plausible because of the temporal concepts and op-

erators mostly used within the field. If a branching or circular temporal

model is utilized, simple concepts like before, or after may be difficult to be

applied7. Thus, within this book a linear temporal model is assumed.

It should be mentioned that most data based on a circular temporal

model can be pre-processed to fit a linear temporal model. If, e.g., data is

retrieved from a simulation which is based on a circular temporal model it

is necessary to ‘roll out’ the circular time, i.e., map time intervals of the

circular time to time intervals of the linear time as indicated in Figure 2.8.

The figure depicts a circular temporal model of a week and data generated

in five iterations. The applied mapping links each circular week (i.e., each

week of each iteration) to a week of the linear time.

7 For discussions within other research areas the interested reader is referred to Alur, Hen-zinger (1992), Frühwirth (1996), Hudry (2004), and Ossimitz, Mrotzek (2008).

2.1 Time 19

Figure 2.8: Example of a mapping between data of a circular temporal model to a linear temporal model.

Bounded and Unbounded Temporal Models

The discussion about bounded or unbounded temporal models are, in the

context of data analysis, more or less philosophical. A bounded temporal

model is a model which has a defined start (i.e., a smallest time point) and

a defined end (i.e., a greatest time point). Within an unbounded temporal

model, infinitive time points are allowed, i.e., the interval

[01.01.2015 09:00, ∞] is infinitive considering its end. If data from an un-

bounded temporal model should be analyzed it implies that there is no be-

ginning or ending of time, i.e., there is always a time point earlier or later.

Analyzing data within such a model would mean that unlimited data is avail-

able (i.e., defined by a discrete or continuous function); if not, limited data

can be analyzed by the bounded temporal model by using the minimal and

maximal time point of the limited data as boundaries. Nevertheless, unlim-

ited data which is, e.g., defined by a recursively defined discrete function,

could be analyzed within a time window which defines the boundaries used

for the bounded temporal model (as illustrated in Figure 2.9).


Figure 2.9: Selection of a time window from an unbounded temporal model to be presented and analyzable in a bounded temporal model.

Taking into consideration the above-mentioned findings, a bounded

temporal model is used within this book.

2.1.4 Temporal Operators

A temporal operator for time intervals expresses the relation between typi-

cally, but not exclusively, two intervals. Within the last decades, several tem-

poral operators were defined (cf. Moerchen (2009) for an extensive over-

view). In the majority of cases, the temporal operators of Allen (1983) are

used. The primary reason for this is that the list of 13 defined operators is

complete regarding possible combinations. Figure 2.10 depicts the defined

operators.

Figure 2.10: Overview of Allen’s (1983) temporal operators.

2.1 Time 21

Nevertheless, Moerchen (2009) states that Allen is not robust considering

small changes and ambiguous regarding one’s intuition. The first point can

be ignored if exact boundaries are requested. However, the latter point

mentioned refers to the problem that the size of overlaps or gaps is not

taken into account using Allen’s relations. Figure 2.11 illustrates the con-

cerns mentioned by Moerchen. The relation between the intervals A and B

are considered to be equal to C and D (both overlap). The same problem

can be observed by looking at the relation between the intervals E and F

and the relation between G and H which are both considered to be equal.

Figure 2.11: Illustration of the ambiguousness of Allen’s (1983) temporal opera-tors.

As already mentioned, several other temporal operators were published

over the last decades. These other approaches mainly focus on overcom-

ing the problems of Allen’s definition regarding robustness and ambiguous-

ness. Some try to achieve that by adding additional relations (e.g., Roddick,

Mooney (2005) which define a total of 49 relations of which nine are differ-

ent types of overlaps), others split intervals to generate partial relations (cf.

Moerchen (2006a); Moerchen, Fradkin (2010); Peter, Höppner (2010)). De-

spite the doubts mentioned by Moerchen, this book uses the temporal op-

erators of Allen, if not stated differently. If needed, additional precautions

are introduced to overcome the mentioned problems (e.g., the distance-

measure used to find similar time interval datasets introduced in section 6

utilizes the coverage ratio or spacing).


2.1.5 Temporal Concepts

Temporal concepts are used to define semantic categories for arrange-

ments of temporal operators (Moerchen 2009). Several temporal concepts

like past, present, or future, as well as order (i.e., before or after), duration,

concurrency, coincidence, or synchronicity are commonly known and often

used in natural language (cf. Moerchen (2006b), Kranjec, Chatterjee

(2010)). Regarding the context of time interval data analysis and especially

in the field of knowledge discovery (i.e., data mining) or even more specific

in the field of temporal pattern mining, temporal concepts are often used to

explain or classify patterns found within a time interval dataset. For exam-

ple, the frequent occurrence of five periodically arranged time intervals may

indicate an interesting observation. Nevertheless, searching for interesting

and infrequent patterns may also be of interest, regarding coincidences or

abnormal situations. A detailed discussion regarding temporal pattern min-

ing as a part of time interval data analysis is provided in section 2.2.1 and

3.2.2. However, within this book commonly known temporal concepts, as

exemplarily depicted in Figure 2.12, are used to express temporal arrange-

ments of temporal operators.

Figure 2.12: Examples of commonly used temporal concepts.

2.1 Time 23

2.1.6 Special Characteristics of Time

In this section, several characteristics of time are introduced which have to

be handled with special care with regards to time interval data analysis.

Depending on the context of the analysis, some characteristics may be ir-

relevant. Thus, it is advisable to validate the impact of the characteristics

within each analytical context. The introduced characteristics are: time

zones, special days (like weekends, holidays, or vacation periods), leap

seconds, leap years, absolute and relative time, as well as the general

complexity of the time dimension.

Time Zones and the Coordinated Universal Time (UTC)

The world is divided in several time zones, each defined by the specifica-

tion of an offset from the coordinated universal time (UTC). When analyzing

temporal data the time zone information is of great importance to ensure

the validity of the analytical results (cf. Kimball, Ross 2002, p. 240; Carmel

1999; Espinosa et al. 2007). Figure 2.13 illustrates an example which ex-

emplifies the importance. The figure shows time interval data recorded

within three time zones (i.e., UTC+1, UTC-8, and UTC-5). The example

implies that data collected in the time zones UTC+1 and UTC-8 represent

tasks performed at different airports. The interval shown within the UTC-5

time zone indicates an event having significant impact (e.g., 09/11, a stock

market crash, or the moon landing). Analyzing the pictured scenario with-

out taking the time zones into consideration is possible and valid, e.g., if

the dataset of an airport is analyzed separately from the other. To compare

the work-performance between the two airports (e.g., in the morning) it is

necessary to analyze the time interval dataset using local times, ignoring

any time zone information. If, on the other hand, the goal of the analysis

aims to determine the impact of the event occurred within the UTC-5 time

zone, it is necessary to perform the analysis using a normalized time (e.g.,

UTC).


Figure 2.13: Example of the impact of different time zones within the scope of temporal analytics.

In order to meet the requirements, it is necessary for an information

system and the underlying data model to understand the difference be-

tween normalized and local time, as well as the concept of time zones. The

impact of time zones is addressed in section 4.1 (regarding the modeling

of the time axis), 4.4 (with regard to different dimensional modeling), and

7.2.1 (concerning the implementation).

Daylight Saving Time (Summer Time)

Changing the time during summer to increase the duration of daylight into

the evening is a common practice in several countries. Nowadays, there

are ongoing discussions if this practice is still meaningful and a minority of

countries decided to abandon daylight saving time (DST). Nevertheless,

from an analytical point of view DST is a difficulty which has to be consid-

ered and managed (cf. Celko 2006, pp. 26–27). The main issues while

2.1 Time 25

dealing with temporal data and DST occur during two days a year (i.e., one

when the time must be adjusted back one hour, the other one when it is

forwarded). These days have 23 or 25 hours which makes it difficult to com-

pare these days to any others. The problem can be exemplified when as-

suming a company utilizing an app to measure the employees’ performed

tasks during a day. Analyzing the average amount of performed tasks within

an hour may lead to false results and therefore to erroneous decisions.

Figure 2.14 illustrates the problem regarding DST and statistical values.

Calculating the amount of time intervals between 03:00:00 and 04:00:00

results in 1 for the default (DEF), 2 for the forward (DST) and 0 for the

backward case (DST).

Figure 2.14: Illustration exemplifying the error of calculating statistical values, e.g., the amount of intervals per hour.

In general, several other statistical measures (depending on the con-

text) may be affected by DST, e.g., in the context of work time management:


the daily performance, workload, or throughput. In addition, similarity

measures (e.g., searching for similar days), which do not consider DST,

may provide incorrect matches. A further discussion on how to analyze

days with DST is presented in section 6 and 7.3.4.

Weekends, Holidays, Vacation Periods and Special Days

Depending on the context of the analysis, weekends, holidays, vacation

periods, and context specific special days may be of importance to under-

stand specific observations, patterns, or anomalies. As already mentioned

in the case of time zones, an event like a holiday or the beginning or ending

of a vacation period can have a significant impact. For example, a travel

agency’s amount of customers, and therefore the amount and duration of

consultations, may increase. Analyzing the workload without considering

vacation periods may lead to invalid conclusions. Patterns searched across

days, may differ meaningfully regarding holidays, weekends, and work

days.

Supporting different types of days8, is an important feature when ana-

lyzing time interval data (cf. Kimball, Ross 2002, pp. 38–41). The need or

importance of this additional information in the context of time interval data

analysis may depend on the location the data is recorded at (e.g., a munic-

ipal holiday) and/or the goal of the analysis (e.g., 9/11 may be an important

date considering cause studies, cf. Figure 2.13) . Some ideas on how to

handle this additional information are discussed in chapter 9.

Leap Seconds

Leap seconds are applied to the UTC to keep it close to the mean solar

time. If not applied, the UTC would drift away (Whibberley et al. 2011).

Thus, a leap second is inserted whenever the International Earth Rotation

8 A not further discussed part of analyzes is the detection of special days within a specific domain by, e.g., using cluster or classification analysis. For further information, the reader may consider Grabbe et al. (2014), which applies clustering technique to find related days based on weather information, and Christie (2003), which uses classification techniques to identify outlying performances so called major event days.

2.1 Time 27

and Reference Systems Service (IERS) decides to apply one. In the ma-

jority of cases, leap seconds are not relevant for analysis. However, Google

states in their blog-post "Time, technology and leaping seconds" that "hav-

ing accurate time is critical to everything we do at Google". Furthermore,

Pascoe states that "keeping replicas of data up to date, correctly reporting

the order of searches and clicks, and determining which data-affecting op-

eration came last are all examples of why accurate time is crucial to our

products and to our ability to keep your data safe" (Pascoe 2011). To

achieve that, Google introduced the concept of leap smear. The idea be-

hind a leap smear is to spread the additional (or shortened) second over a

specific time window (e.g., the last minute before midnight), instead of wait-

ing or shorting the last minute. It was mainly introduced, so that developers

and engineers can rely on the system time without considering leap sec-

onds at all. Within common operation systems and programming lan-

guages, leap seconds are not supported, i.e., the clock or internal counter

does not display nor handle leap seconds. Instead, the second is added by

counting the last second of the minute the leap second is scheduled for

twice.

Summarized it can be stated that leap seconds may influence the re-

sults of temporal analytics. This may be the case if the selected granularity

of time is in the range of seconds or less and the operation system handles

leap seconds by counting the last second twice. If the concept of leap

smear is applied or other specialized time protocols (e.g., Precision Time

Protocol) are used, leap seconds should not lead to any problems. Never-

theless, statistical calculation may be off by up to one second. Within this

book, the handling of leap seconds in association with the introduced in-

formation system is discussed in section 4.1.

Leap Years

The Gregorian calendar differentiates between common years and leap

years. The former has 365 days, whereas the latter has 366 days (adding


the 29th of February, namely the leap day). Depending on the level of ag-

gregation used when analyzing temporal data, the existence of a leap day

within a year may or may not invalidate the results. Thus, statistical

measures aggregated on a year-level (e.g., sum or count) are not compa-

rable between a leap year and a common year. A solution, to overcome this

problem is the usage of relative values (e.g., mean or median) or a com-

parison on a valid level (e.g., by comparing sorted sets ignoring the addi-

tional day). In this book, the handling of leap years is discussed in section

6 and 7.3.4.

Absolute vs. Relative Time

Time dependent data can be collected in an absolute or a relative manner.

In general, an absolute time interval consists of two time points each spec-

ified by date, time, and time zone9. Contrary to this, a relative time interval

consists of two time points each typically specified by an integer or a float-

ing point number. Thus, relative time interval data is mostly found in sce-

narios in which the absolute time is irrelevant, e.g., when comparing time

interval data collected from several process runs, each starting at a nor-

malized moment in time, e.g., 0. Most researches within the field of data

mining assume relative time interval data for their pattern mining algorithm.

Nevertheless, in the context of on-line analytical processing (OLAP) and

mining (OLAM), which both considers the existence of dimensions, abso-

lute time interval datasets are mostly used. Thus, an information system

has to be capable of handling relative and absolute time interval data (cf.

section 4.1).

Complexity of Time Dimension

The time dimension is an important and probably the most frequently used

dimension within multidimensional models (cf. Kimball, Ross 2002, pp. 38–

41). Considering OLAP and temporal data, aggregating data along the time

9 The time zone information is often omitted because the system’s local time zone is expected to be implicitly used.


dimension is one of the pre-dominant operations (Agarwal et al. 1996;

Chaudhuri, Dayal 1997; Zhang et al. 2001), e.g., analyze the different

months, detect anomalies, and understand the reasons for the anomalies

by looking at the days of the months. In the field of temporal pattern mining,

the different levels of the time dimension are often used to specify time

dependent filters or ranges, e.g., detect frequent patterns occurring on

Mondays. Using the time dimension in the context of analytics reveals sev-

eral problems.

One of the problems to deal with is the fact that a calendar week does

not neatly fit into a month nor a year. Thus, a time hierarchy like day →

calendar week → month → year risks summarizability and comparison

problems (Hutchison et al. 2006; Mansmann, Scholl 2006; Mazón et al.

2008). Solving, or at least revealing this problem to the querying user, is an

important aspect to ensure correct usage of provided results. In section

3.2.1, several solutions on a conceptual or logical level are presented. In

section 4.4, the modeling of the time dimension considering an information

system for time interval data analysis is introduced and the handling of the

mentioned problem is further discussed.

Another problem when dealing with the time dimension is the already

mentioned variety of additional information attached to a member. A day

may be, e.g., a global or municipal holiday, a memorial day, or a special

event like tax day or 9/11 (cf. Weekends, Holidays, Vacation Periods and

Special Days). Considering the time dimension, such additional information

may be used to define special hierarchies (e.g., days may be rolled-up to

a level containing members like none, municipal, national, and international

holiday). Special time hierarchies are discussed in section 4.4.

2.2 Features of Time Interval Data Analysis Information System

As noted in the introduction of this chapter, several workshops with ana-

lysts from different domains were organized addressing the issues oc-

curring when analyzing time interval data. The first workshop "Business


Intelligence: How do you use your temporal data?" was held with 64 inter-

national companies (mainly aviation industry, logistics providers, and

ground-handling service providers) during the "Inform Users Conference

2012". Additional workshops were organized during the following years

aiming to reveal further insights, understand specific problems (e.g., occur-

ring using proprietary software products), or to specify requirements (e.g.,

regarding the query language or special visualizations). The number of par-

ticipants varied according to the purpose of the workshop and was distrib-

uted among a number of different sectors, i.e., aviation, logistic, ground-

handling, call-center, hospitals, temporary employment, and linguistic. Al-

together, more than 20 workshops, organized as expert discussions (i.e.,

between three or six experts from one or different companies), as business

users workshop (i.e., up to 10 managers and experts were invited to dis-

cuss expected results), or as part of a users’ conference (i.e., more than

20 experts), were held between 2012 and 2015.

The following sections present features derived from the results of the

workshops and complemented by an extended literature research. The dif-

ferent features are categorized in analytical features (section 2.2.1), fea-

tures defined along a time interval data analysis process (section 2.2.2),

and features associated to the user interface (UI) of an information system

for time interval data analysis (section 2.2.3). These features can also be

understood as functional requirements. Nevertheless, non-functional re-

quirements (e.g., regarding the performance or robustness) are not dis-

cussed in detail. Instead, relevant non-functional requirements are dis-

cussed and motivated implicitly within the different sections and used to

motivate specific implementation strategies (i.e., authorization and authen-

tication in section 5.1, indexing in section 7.3.2, and caching in section

7.3.3).

2.2.1 Analytical Capabilities

In the field of data analysis, a distinction is made between different analyt-

ical techniques. In general, techniques are categorized in descriptive


("What has happened"), predictive ("What could happen"), and prescriptive

("What should happen") analytics (IBM Corporation 2013). During the

workshops one of the goals was to determine which techniques must be

supported and how the support may be realizable by specifying desired

features. The results indicate that, regarding the analysis of time interval

data, a demand for all three categories exists. Nevertheless, none of the

categories is currently satisfactorily covered by any available information

system and the importance differs between the three categories.

Descriptive Analytics

The results of the workshops indicate that the need for descriptive analytics

is very high. Experts stated that "understanding the current situation and

past observations", as well as "being able to determine causes for anoma-

lies" are important first tasks. The feature requests assigned to the type of

descriptive analytics are listed in Table 2.1.

Table 2.1: Overview of the features requested in the category descriptive analyt-ics.

Identifier Description Priority DA-01 As an analyst, I want to aggregate the time in-

terval data along the time-axis, using different aggregation methods (must: SUM, COUNT, MAX, MIN, MEAN; should: MEDIAN; can: MODE). The aggregation must be correct con-sidering summarizability.

critical

DA-02 As an analyst, I want to be able to use temporal aggregation methods along the time-axis (must: COUNT STARTED, COUNT FINISHED).

high

DA-03 As an analyst, I want to be able to retrieve the raw time interval data within a specified time window (i.e., by using a query language). In ad-dition, it should be possible to specify the tem-poral operator specifying the relation between the interval to be retrieved and the time window (e.g., retrieve all intervals equal to the specified time window).

high


DA-04 As an analyst, I want to roll-up and drill-down the time dimension. The levels of the different time hierarchies should support the definition of buckets for lower granularities (i.e., minutes and seconds).

critical

DA-05 As an analyst, I want to specify dimensions for the different properties associated to the time in-terval. Furthermore, I want to use these dimen-sions to generalize or specialize the result.

critical

DA-06 As an analyst, I want to analyze data from dif-ferent time zones. More specifically, I want to be able to analyze data from different time zones using local time zones, as well as a generalized time zone like UTC.

medium

DA-07 As an analyst, I want to be able to compare, e.g., hours, days, or weeks. In addition, I should be capable of searching for similar situations by selecting a template, e.g., hour, day, or week.

medium

DA-08 As an analyst, I want the system to provide a query language to retrieve analytical results (i.e., time series, mining results)

critical

Figure 2.15 exemplifies selected features, i.e., DA-01 (aggregate),

DA-03 (select records), DA-04 (roll-up & drill-down time dimension), and

DA-05 (roll-up to department & drill-down to work-area). The raw intervals

(top left, DA-03) are aggregated applying count aggregation on the lowest

granularity (top middle, DA-01). The roll-up and drill-down operations are

applied (illustrations on the lower part of the figure, DA-04). The realization

of these features is addressed in the context of modeling the time axis (cf.

section 4.1) and dimensional modeling (cf. section 4.4). In addition, solu-

tions for overcoming the summarizability problems occurring while realizing

these features10, are presented in section 7.3.4.

10 The problems occur when using available proprietary software (cf. Mazón et al. (2008)) or algorithms presented in the field of temporal databases (cf. section 2.1.2). Lately, several proprietary tools like icCube, Microsoft Analysis Services, or IBM Cognos presented fea-tures to support many-to-many relationship (cf. Russo, Ferrari (2011)). Nevertheless, as


Figure 2.15: Overview of selected features defined in the category descriptive an-alytics in the context of time interval data analysis (cf. Table 2.1).

At this point, the features DA-02, DA-06, DA-07, and DA-08 are not pre-

sented in the figure. A detailed introduction for these features is given in

the relevant section which introduces a concrete solution, several exam-

ples, as well as modeling, definition, and implementation aspects, i.e., sec-

tion 7.3.4 (DA-02), section 4.4 (DA-06), chapter 6 (DA-07), and section

5.3.3 (DA-08).

discussed in section 3.2.1, these solutions cannot be applied satisfactorily in the context of time interval data.


Predictive Analytics

In the case of prescriptive analytics the workshops have shown that the

need is not rated as high as for descriptive analytics. One of the reasons

stated by experts is the assumption that without appropriate descriptive

analysis tools, features regarding predictive or prescriptive analysis are dif-

ficult to formulate. Another reason, indicated by experts, may be the avail-

ability of appropriate, proprietary software. For example, in the case of

workforce management, several software products are available, e.g., use-

ful to create rule-based rosters or simulate defined scenarios. The issues

arising when using these tools are the definition of the rule-sets or the sce-

nario’s parameters. To formulate such a rule-set or determine the parame-

ters, a better understanding of current and past situations is required which

support the necessity of descriptive analytics. Nevertheless, some aspects

of predictive analytics were classified as meaningful and are summarized

in Table 2.2.

Table 2.2: Overview of the features requested in the category predictive analyt-ics.

Identifier Description Priority

PD-01 As a manager/supervisor, I want to be able to observe specified measures and be alerted if a defined threshold may be reached in the near future.

medium

PD-02 As an analyst, I want to be able to find patterns or rules within a time interval dataset. Thus, it is necessary to specify the scope of the mining (e.g., just Mondays or holidays). In addition, it is of interest to validate if a pattern found within Mondays can also be found within other sets, e.g., Tuesdays, weekdays, or days of July.

low


Prescriptive Analytics

The aim of prescriptive analytics is to optimize upcoming situations by

knowing what should ideally happen and rate different outcomes. The ar-

guments mentioned in the case of predictive analytics apply, as well, in the

case of prescriptive analytics. There are several tools used by data scientist

enabling prescriptive analytics. However, the access to time interval data is

quite difficult. Thus, an information system, as introduced in this book, is

needed to provide an easy access and help for analyzing data in a descrip-

tive way, prior to any prescriptive analysis. Regarding the results of the

workshops, the requests expressed in the field of predictive analytics over-

lap with the once of prescriptive analytics. Table 2.3 shows a concise sum-

mary for the mostly openly formulated feature requests.

Table 2.3: Overview of the features requested in the category prescriptive ana-lytics.


PR-01 As a manager, I want the system to be able to predict upcoming situations (e.g., staff short-ages) and provide solutions to the responsible dispatcher.

low

PR-02 As an analyst, I want the system to be usable with other tools useful for prescriptive analytics (e.g., R11, Apache Spark12, or Watson Analyt-ics13).

low

2.2.2 Time Interval Data Analysis Process

Another purpose of the workshops was the determination of a generalized

process for time interval data analysis, applicable to an information system.

11 http://www.oracle.com/technetwork/database/database-technologies/r 12 https://spark.apache.org 13 http://www.ibm.com/analytics/watson-analytics


In general, the process of data analysis14, also known as data science pro-

cess, is defined by several iterative phases (Schutt, O'Neil 2014, pp. 41–

44). Figure 2.16 depicts the data science process.

Figure 2.16: The data science process following Schutt, O'Neil (2014).

The process starts with the "Raw Data Collection" step, which is fol-

lowed by the "Processed Data" step. Typically, data integration techniques

are used by an analyst to process data in a way to create organized data

ready for analysis. Nevertheless, the organized data may contain missing

information, invalid entries, or duplicates. Thus, a clean dataset is derived

during the second step by applying, e.g., data enrichment, outlier detec-

tion, or plausibility check techniques. In order to obtain a clean dataset or

understand the data, it may be necessary to use exploratory data analysis

(EDA) techniques, used to reveal further insights and clarify the validity.

Having a clean dataset and understanding it, enables the analyst to detect,

e.g., relationships, patterns, or causalities ("Apply Models & Algorithms").

Models may be generated and applied during this step to simplify the anal-

ysis. During the last steps, i.e., "Data Product" and "Communicate, Visual-

ize, Report" the results created (e.g., a model, a rule, or a cause) and in-

sights gained are used by a data product (i.e., an application) to create

14 The process is comparable to the knowledge discovery in databases (KDD) process (Fay-yad et al. 1996) or the more general visual analytics process (Keim 2010, pp. 10–11).


(automated) results (e.g., recommendations) or are presented to a decision

maker.

The data science process aims to encapsulate the tasks performed by

an analyst when analyzing any kind of data. Thus, it is applicable to time

interval data analytics. Nevertheless, from an information system point of

view the process is to generic and wide. Discussions during the different

workshops have shown that from an analyst point of view several steps

should be redefined or narrowed. In addition, it was pointed out that an

information system may have to perform tasks automatically on each single

time interval data record pushed into the system (c.f. feature request PD-

01). Figure 2.17 illustrates the time interval data analysis process based

on the results of the workshops. The figure differentiates between steps

which should be supported by an information system (colored boxes) and

steps performed by other systems, an analyst or a user (white boxes). Sup-

porting describes the ability of the information system to perform the step

automatically (e.g., based on configuration or modeling). In contrast to the

data science process, the depicted time interval data process described

the steps from an information system or data point of view instead of the

perspective of an analyst. The analyst uses the information system to

query, interact, or understand the time interval dataset and additionally

configure and model the system (which is a cross-sectional task, and there-

fore not illustrated).


Figure 2.17: The result of the workshops regarding the time interval data analysis process.

The process starts with the collection of time interval data from an avail-

able and configured source. The collection might be a recurring (i.e., load

the data whenever new data is available) or a one-off task (i.e., load data

once into the system to analyze the set). The information system processes

the incoming data using defined data integration techniques (step: "Pro-

cessed Data"). Within the next step, the processed data is cleaned and a

valid dataset is received (step: "Clean Dataset"). At this point, the analyst

is capable to interact with the system, e.g. by firing queries or using a pro-

vided UI, useful to perform hypothesis testing, validation, or monitoring

(step: "Retrieve, Visualize"). In addition, the analyst might retrieve and vis-

ualize results created by defined exploratory data analysis tasks, data min-

ing algorithm, or machine learning concepts (step: "Apply Algorithms &

Models"). Depending on the configuration of the information system, the

defined algorithms and models are applied automatically used to deter-

mine if an alert has to be generated (step: "Data Observer") or report re-

sults to a decision maker (step: "Communicate, Visualize, Report").


In the following, the requested features for the steps: "Raw Time Interval

Dataset" (Data Linkage & Collection), "Processed Data and Clean Dataset"

(Data Integration & Cleansing), and "Apply Algorithms & Models" (Applica-

tion of Models & Algorithms) are introduced and discussed. Features de-

manded in the context of visualization and interaction (i.e., steps "Retrieve,

Visualize" and "Communicate, Visualize, Report") are presented in section

2.2.3. Requirements considering the "Data Observer" step are considered

in section 2.2.1 (cf. Predictive and Prescriptive Analytics).

Data Linkage & Collection

An information system for time interval data analysis has to provide inter-

faces enabling the loading of data into the system. During the first devel-

opment phases and workshops several different ways on loading data into

the system were discussed. Furthermore, scalability and data integrity

were important topics when discussing the topic of data collection. Table

2.4 shows the subsumed features requested.

Table 2.4: List of requested features for the information system considering data collection.


DC-01 As a system provider, I want the system to sup-port different data sources, e.g. databases (i.e., relational DBMS), files (i.e., CSV or XML), and streams (i.e., JSON). If not supported, a simple application programming interface (API) must be available to enable me to add unsupported data sources.

High

DC-02 As an analyst, I want the provision of a Java Da-tabase Connectivity (JDBC) driver and a query language which allows the insertion and deletion of data. In addition, bulk loading operations should be supported.

Critical

DC-03 As a system provider, I want to be able to spec-ify pre-aggregates to be calculated by the sys-tem, to increase query performance.

High


Although the features requested are mostly self-explanatory, it should

be mentioned that the realization of these feature is presented and dis-

cussed further in section 7.2.1 (DC-01), section 5.3.1 (DC-02), and section

7.3.4 (DC-03).

Data Integration & Cleansing

Whenever data is loaded into the information system, it is important that

the data is integrated and cleaned, so that invalid entries are detected,

missing data is enriched, and the internally needed data structure is ap-

plied. The discussions considering data integration and cleansing was di-

versely, especially the question: "Which data integration techniques must

be available by the system and at which point dedicated data integration

tools should be applied as pre-processor". Table 2.5 shows the results of

the discussions and additional feature requests defined within the work-

shops.

Table 2.5: List of requested features for the information system considering data integration & cleansing.

Identifier Description Priority DI-01 As an analyst, I want the system to be capable

to handle complex data structures, in particular many-to-many relationships (cf. Kimball, Ross (2002), Mazón et al. (2008)).

Critical

DI-02 As an analyst, I want to be able to validate the descriptive values (properties) associated to the time interval. Validation must ensure that the value is not empty (i.e., mark a property as re-quired), that the value is allowed to be used (i.e., by providing a white-list), or how a new value is handled (i.e., add it, use null, or fail).

High

DI-03 As an analyst, I want to be able to define how undefined intervals (i.e., intervals which have no start, end, or neither defined) are handled. Typ-ically, I should be able to pick one of the follow-ing strategies: use time axis boundaries, use the

High


(other) specified value (i.e., create a time point), or fail.

DI-04 As an analyst, I want to be able to write scripts applied to the raw data prior to any processing or cleansing. Thus, I am able to manipulate the incoming data without pre-processing it using in-tegration tools.

Medium

The feature requests DI-02 and DI-03 are defined to cover important

and often, in the context of time interval data analysis, applied strategies.

The specified strategies are used to ensure data quality (by plausibility

checks) or to offer the possibility to enrich missing values. DI-04 is re-

quested as a last resort, i.e., the information system should offer a scripting

interface useful to implement integration or cleansing techniques. This in-

terface enables an analyst to apply techniques prior to using additional

data integration tools. In addition, the interface might even be used to trig-

ger a more complex integration process defined with a proprietary integra-

tion tool (cf. Meisen et al. (2012)).

The requirement formulated with feature request DI-01 addresses the

already mentioned summarizability problem, which occurs when using

many-to-many relationships and is introduced in detail in section 3.2.1. Re-

garding the used model introduced in chapter 4, the feature request DI-02

is partly covered by so called mapping functions (cf. section 4.1 and 4.2).

In addition, the final implementation provides additional strategies to fulfill

the request (cf. section 7.2.1).

Application of Models & Algorithms

The requested capabilities of the information system considering descrip-

tive, predictive and prescriptive analytics are listed in section 2.2.1. In ad-

dition, this section specifies architectural requirements to be met by the

system to support these analytical capabilities. The features requested are

listed in Table 2.6 and the implementation is introduced in section 7.1.


Table 2.6: The features required to support the application of models and analyt-ical algorithms.


MA-01 As an analyst, I want to be able to apply models or algorithms to the data stream, i.e., I want to determine problems, generate alerts, report anomalies, or classify the current data.

Medium

MA-02 As an analyst, I want to be able to schedule analysis (e.g., daily) using the currently availa-ble data. Depending on the result of the analysis I want to trigger an action (e.g., send an email).

Medium

2.2.3 User Interface, Visualization, and User Interactions

An important criterion regarding the user acceptance of a system is its in-

terface. The UI may be graphical (e.g., showing a graph) or a query lan-

guage. In general, the user needs capabilities to interact with the system,

so that a request can be specified or an alert be understood. Table 2.7

shows the features relevant for the information system. Features dealing

with specific visualization15 are not listed, because the development of spe-

cific visualizations are not in the scope of this book. Nevertheless, the in-

terested reader is referred to section 3.2.3., which introduces current state

of the art visualizations regarding time interval data and time series. Ideas

considering the usage of visual analytics techniques in the context of time

interval data analysis are discussed in section 7.4.

Table 2.7: Overview of the features requested for the UI, visualization, and user interaction.


VIS-01 As an analyst, I want to be able to retrieve data from the information system using a JDBC driver to visualize the results, e.g., using a third party business intelligence tool, a visualization,

High

15 E.g., a specific request for a line chart was to show the involved time intervals in a tool tip when hovering the value.

2.3 Summary 43

or another analytical framework. Thus, I implic-itly request a query language useful to retrieve data as needed.

VIS-02 As an analyst, I want to be able to subscribe to the system’s alerts and analytical results. The system must publish the requested information to any subscribed instance.

Medium

VIS-03 As a system provider, I want to have a UI for user management (i.e., delete or add users, de-fine roles, grant or revoke a permission).

Critical

VIS-04 As an analyst, I want to have a minimal graph-ical user interface (GUI) useful to request and visualize results (e.g., a time series, resulting datasets, or a Gantt-chart).

High

VIS-05 As a web-developer, I want the system to pro-vide web-friendly services, i.e., requesting and receiving data through a JSON interface.

High

2.3 Summary

Within this chapter, several important terms within the context of time in-

terval data analysis were introduced. In addition, features related to an in-

formation system supporting analytical tasks were presented. These fea-

tures are motivated along temporal aspects and characteristics of time

(e.g., temporal models, leap years, or time zones), as well as subsumed

results from several workshops and an extended literature research. Fur-

thermore, some subordinate features mentioned during the workshops, like

specific requirements regarding specific statements of the query language,

are not listed. Nevertheless, these feature requests are stated within the

different upcoming chapters, if relevant.

This chapter also provides the answer to the first RQ: "Which features

must be supported by an information system to enable time interval data

analysis". An information system has to support the time characteristics,

as well as provide the specified features in a performant way. An evaluation


regarding the fulfillment of the features is presented in section 8.1. In addi-

tion, these features provide the basis for the other research questions. A

model for time interval data analysis (as mentioned in RQ2) is needed as

formal framework for such an information system. The need for a query

language (as addressed by RQ3) is explicitly or implicitly mentioned in sev-

eral features (e.g., DA-01, DA-02, DA-03, DA-08, PR-02, DC-02, or VIS-

01). The performance of an analytical information system is, even if not

explicitly mentioned, of importance and the core issue of RQ4. The similar-

ity among difference sets of time interval data is requested by feature DA-

07 and topic of RQ5. The architecture and configuration of an information

system are aspects to consider when realizing such a system. In addition,

the needed interfaces (e.g., JDBC, JSON, or visualization) of time interval

data and results of analyses are addressed by, e.g., DC-01, DC-02, VIS-

01, VIS-04, and VIS-05. The RQ6 subsumes the mentioned aspects re-

garding the architecture, configuration, and interfaces.

3 State of the Art

Time interval data is in the focus of research over the past years and dec-

ades. Several aspects, dealing with (time) interval data, have been ad-

dressed and are introduced in this chapter. As motivated in chapter 2, the

following research areas are of interest when implementing an information

system useful to analyze time interval data: concepts applied when creat-

ing analytical information systems (section 3.1), different approaches re-

garding the analysis of time interval data (section 3.2), query languages

used to answer analytical questions (section 3.4), and similarity measures

(section 3.5). In addition, the so far only peripherally mentioned perfor-

mance improvements (section 3.3) are important research areas regarding

the performance of the query processing.

3.1 Analytical Information Systems

The term analytical information systems (AIS) is used in general as a "de-

scriptor for a broad set of information systems that assist managers in per-

forming analyses" (Power 2001), which is often used in conjunction with BI,

Decision Support Systems (DSS), Data Warehouses (DW), or OLAP (Stroh

et al. 2011; Teiken 2012, p. 7). In general, "analytics software encompasses

three main technologies: (1) database management, (2) mathematical and

statistical analysis and models, and 3) data visualization and display"

(Power 2012).

In science, the term AIS is used in different areas, e.g., in the field of

spatial data processing (e.g., Goodchild (1987) or Paramonov et al.

(2013)), regarding solutions for specific domains like power supply or

budget planning (e.g., Kamaev et al. (2014) or Rego et al. (2015)), or gen-

erally, as already mentioned, as synonym for DSS, BI, DW, or OLAP. Thus,

AIS for a specific type of data is only considered in the field of spatial data

and geographic information system (GIS). The architectures presented in

the different domain-specific or BI related solutions are based on several


46 3 State of the Art

components like databases, integration tools, meta layer, data ware-

houses, and an application (Teiken 2012, pp. 8–15). A holistic solution en-

capsulating these different components to analysis specific data is not pre-

sented.

3.2 Analyzing Time Interval Data: Different Approaches

Within the field of data analysis several technologies, techniques, and

methodologies have been introduced. From an algorithmic point of view the

developed solutions can be categorized into statistical analysis (i.e., de-

fined by Dodge, Marriott (2006) as "the study of the collection, analysis,

interpretation, presentation and organization of data"), data mining (i.e.,

defined by Fayyad et al. (1996) as "a step in the KDD process that consists

of applying data analysis and discovery algorithms that produce a particu-

lar enumeration of patterns"), machine learning (i.e., defined by Arthur

Samuel in 1959 as " [the] field of study that gives computers the ability to

learn without being explicitly programmed"), and visual analytics (i.e., de-

fined by Thomas, Cook (2005, p. 4) as "the science of analytical reasoning

facilitated by interactive visual interfaces"). Within the context of AIS and

time interval data analysis, the following research topics are of special in-

terest16, i.e., OLAP (section 3.2.1) useful to perform hypothesis testing,

temporal pattern and association rule mining (section 3.2.2) suitable to find

patterns, and visual analytics (section 3.2.3) appropriate to enable the user

to visualize data and discover new insights by using innovative interaction

techniques. Other topics like, e. g., clustering, supervised learning, or re-

gression, known from machine learning or data mining, are not further dis-

cussed nor introduced17.

16 The fields were selected according to the formulated feature requests listed in section 2.2. 17 The information system provides an interface to apply models or algorithms as requested

by feature MA-01 and MA-02 (cf. section 2.2.2). Thus, the algorithms or models are not in the focus and are assumed to be applied. Nevertheless, the information system may be used to create models or algorithms by providing data and deeper understandings.


3.2.1 On-Line Analytical Processing

For several years, business intelligence and analytical tools have been

used by managers and business analysts, inter alia, for data-driven deci-

sion support on an operational, tactical, and strategic level. An important

technology used within this field is OLAP, used especially for hypothesis

testing. OLAP enables the user to interact with the stored data by querying

for answers. This is achieved by selecting dimensions, applying different

operations to selections (e.g. roll-up, drill-down, or drill-across), or compar-

ing results. The heart of every OLAP system is a multidimensional data

model (MDM), which defines the different dimensions, hierarchies, levels,

and members (Cood et al. 1993). Recent research dealing with OLAP is

focused on: summarizability problems (Lenz, Shoshani 1997; Mazón et al.

2008, 2009; Niemi et al. 2014) and MDM (Kimball, Ross 2002; Chui et al.

2010; Koncilia et al. 2014; Meisen et al. 2014). In addition, different solu-

tions for specific scenarios were presented, e.g., in the context of big data

Wang, Ye (2014) introduce an in-memory cluster computing environment

based on a key-value index, Mendoza et al. (2015) present new textual

measures useful to handle unstructured textual information with OLAP, and

Cuzzocrea (2011) proposes a framework to be used to estimate the result

of OLAP queries in uncertain and imprecise data. In the following, the most

relevant developments for the context of AIS and time interval data are pre-

sented, i.e., research addressing summarizability problems and MDM.

Summarizability Problems

In the field of OLAP, researchers discuss the importance of summarizabil-

ity, which "refers to the possibility of accurately computing aggregate val-

ues with a coarser level of detail from values with a finer level of detail"

(Mazón et al. 2008), and the problems occurring when violating it. In addi-

tion, summarizability is a necessary pre-condition for performance optimi-

zation using pre-aggregation techniques (Pedersen et al. 1999). The sum-

marizability problem addresses the issue of violating summarizability,


which is always the case if non-strict hierarchies are used within the multi-

dimensional model. Furthermore, summarizability problems may occur if

non-covering or non-onto hierarchies are defined, depending on the tech-

nique used to support this type of hierarchy within the logical model (cf.

Spaccapietra et al. 2009, p. 73). Figure 3.1 illustrates the different types of

hierarchies.

Figure 3.1: Examples of the different types of hierarchies (non-strict, non-cover-ing, and non-onto).

In general, the summarizability problem states the multiplication of a

fact, if the fact is associated to multiple members of a higher level (as illus-

trated in Figure 3.1). The problem also arises, if a member refers to several

members on a higher level. In both cases, the fact is multiplied within the

aggregation on the higher level. Considering time interval data, the problem

of many-to-many relationships is always present, because a fact of the in-

terval is associated to multiple members of the time dimension (i.e., all time

points the interval covers). Figure 3.2 shows two examples of the summa-

rizability problem. On the left side, the number of patients (fact) is associ-

ated with one or multiple diagnoses (cf. Pederson (2000), Song et al.

(2001)). When selecting all patients, a non-aware system would return a

number of 29 patients (5 cancer, 12 stroke, and 12 cancer). On the right

side, an example of a time interval is illustrated. In that case, the resources

(fact) associated to the interval are counted multiple times, i.e., for each

chronon covered by the interval.


Figure 3.2 Two examples of the summarizability problem.

Lately, several proprietary tools like icCube, Microsoft Analysis Ser-

vices, or IBM Cognos implemented the support for non-strict hierarchies

(Russo, Ferrari 2011). As mentioned by Meisen et al. (2014), the presented

implementations are not sufficient when using time interval data. Reasons

are:

– insufficient tooling support (i.e., inadequate lowest granularity and poor

query performance),

– expensive data integration processes (i.e., enormous redundant data

creation, costly discretization of intervals, and unmaintainable configu-

rations),

– non user-friendly query language (i.e., complex language structure

and unsupported temporal semantics), as well as

– inapplicable requirements (i.e., unsupported context specific aggrega-

tions and unsatisfying linkage between intervals and aggregated val-

ues).

Thus, some OLAP-application can interpret non-strict hierarchies and

overcome summarizability problems. However, in the context of time inter-

val data, these solutions are not applicable.

Multidimensional Models

An MDM defines the dimensions, hierarchies, level, member, and facts

within data. Such a model enables the use of operations like roll-up, drill-


down, slicing, or dicing and facilitates rapid data access using relational

databases (ROLAP), multidimensional array structures (MOLAP), or a hy-

brid implementation (HOLAP). Typically, data integration techniques are

needed to map the raw data to a specified MDM. In addition, further meth-

ods, e.g., data cleansing, data enrichment, or aggregation, are applied

within the integration process to ensure data validity, completeness, and

quality (White 2005). In the field of OLAP, several systems capable of an-

alyzing sequences of data have been introduced over the last years. Chui

et al. (2010) introduced S-OLAP for analyzing sequence data. Liu et al.

(2011) analyzed event sequences using hierarchical patterns, enabling

OLAP on data streams of time point events. Bębel et al. (2012) presented

an OLAP like system enabling time point-based sequential data to be an-

alyzed. Nevertheless, these systems and their models neither support time

intervals, nor temporal operators. Recently, Koncilia et al. (2014) and

Meisen et al. (2014) presented a MDM focusing on time interval data anal-

ysis. Both claim to be the first to present such a model.

Koncilia et al. (2014) presented a system named I-OLAP, claiming to be

the first proposing a model for processing interval data. An interval is de-

fined as the gap between two events18. Furthermore, the introduced meta-

model consists of events, dimensions, hierarchies, members, intervals, se-

quences of intervals, and so called I-Cubes. A definition, which types of

hierarchies are supported is not presented. Thus, the support of non-strict

hierarchies and how these would be handled is unclear. In addition, Kon-

cilia et al. assume that the intervals of a specific event-type (e.g., apple

falling) for a set of specific properties (e.g., color and weight) are non-over-

lapping and consecutive (i.e., form a non-overlapping sequence of inter-

vals). This assumption is valid in the specific case of event sequences.

Nevertheless, in the more general case of time interval datasets, the as-

sumption of Koncilia et al. is not valid. E.g., assuming a work-area with

18 A more detailed definition of term event is presented in section 3.2.2.


several workers performing several tasks in parallel19 is one of many pos-

sible scenarios in which the assumption does not hold true.

To support the specific handling of facts and measures Koncilia et al.

introduce two types of functions, i.e. compute value functions and fact cre-

ating functions, which are used to determine the measure of two consecu-

tive events (i.e., e1 and e2, with e1.t < e2.t, so that there is no other e with

e1.t < e.t < e2.t) for all chronons t, which fulfill e1.t < t < e2.t. In addition, two

different aggregation techniques are presented, time point aggregation, as

well as aggregation along time. The former is used to calculate the aggre-

gated value for a specified time point (i.e., chronon) and the latter is used

to determine the aggregated value for a specified time range (cf. TAT intro-

duced in section 2.1.2).

Figure 3.3 illustrates an example supported by I-OLAP. The example

shows several values measured by a temperature sensor (i.e., 3, 4, 2, 5,

and 1; shown as dots). To determine the intervals between the gaps of the

events, the already mentioned compute value functions and fact creating

functions are applied. In the example shown in Figure 3.3 the average func-

tion is used to determine the value for each chronon (i.e.,

(e1.value + e2.value) · 0.5).

Figure 3.3: Illustration of a scenario covered I-OLAP as presented by Koncilia et al. (2014).

19 Analyzing several time interval data from service providers showed that even one worker performs several tasks simultaneously (e.g., check-in and customer service).


Summarized it can be stated that the model defined in the context of I-

OLAP:

– supports the TAT aggregation technique,

– can be used to define measures computed from events and intervals,

– is limited regarding the supported data, i.e., only sets of intervals over

sequential data are considered,

– lacks to define which types of hierarchies are supported (cf. sec-

tion 3.2.1: Summarizability Problems),

– does not introduce the handling of temporal aspects (cf. DA-02, DA-03,

DA-06, or DA-07),

– may not be capable to support all required aggregation methods, e.g.,

MEAN or MEDIAN (cf. DA-01), and

– cannot be applied to larger datasets in a performant way, i.e., the pre-

sented ideas and remarks suggests that the runtime is at least polyno-

mial in the number of intervals.

At the same time as the presentation of I-OLAP, Meisen et al. (2014)

presented the TIDAMODEL. The introduced model covers all types of hierar-

chies (i.e., non-strict, non-covering, and non-onto). In addition, a perfor-

mant implementation capable to overcome summarizability problems is

outlined and further specified in Meisen et al. (2015b). In chapter 4 of this

book, the TIDAMODEL is introduced and discussed in detail. In addition, sev-

eral new aspects not addressed by Meisen et al. (2015b) are introduced

and aligned against the requests mentioned in section 2.2.

3.2.2 Temporal Pattern Mining & Association Rule Mining

Research in the field of data mining and in the context of time interval da-

tasets mainly focus on temporal pattern mining and association rule mining

(Moerchen 2009; Papapetrou et al. 2009). The different mining algorithms

presented over the last years differ in the representation of time interval

data (i.e., the model used to represent the time intervals), the type of pat-

terns searched for (i.e., frequent, closed, or complete-set patterns, cf. Hu


et al. (2010), Chen et al. (2011)), the performance (i.e., number of data-

base scans needed), or constraints (i.e., applying specific constraints to

the patterns to find, cf. Laxman et al. (2007), Peter, Höppner (2010)). In

addition, other topics like clustering (Guyet, Quiniou 2008; Fricker et al.

2011), classification (Batal et al. 2011), or predictions are of interest to re-

search.

In this book, the primary focus is on the application of the algorithms

presented in the context of mining time interval datasets (cf. section 2.2).

Thus, the information system must be capable to provide the time intervals

in a way, so that the mining algorithm can be applied. All mining techniques

regarding temporal sequential pattern mining or temporal association rule

mining in the context of time interval data are based on a definition provided

by Papapetrou et al. (2005). Papapetrou et al. were one of the first to intro-

duce the problem of "discovering frequent arrangements of temporal inter-

vals". The problem stated by Papapetrou et al. is based on so called e-se-

quences. An e-sequence is a (temporally) ordered set of events, whereby

an event is defined by a start value, an end value, as well as a label. In

addition, an e-sequence database is defined as a set of e-sequences. The

definition of an event given by Papapetrou et al. is close to the definition of

an interval outlined in section 2.1.1 and the formal definition presented in

section 4.3. In addition to the definition of Papapetrou et al., the definition

presented in this book allows the categorization of an event by multiple

properties20 (i.e., labels), as well as the assignment of facts (i.e., values

which can be aggregated).

Summarized, the commonly used model of time interval data used in

the field of pattern or association rule mining does not recognize any di-

mensional aspects. Nevertheless, regarding the increasing usage of di-

mensional information within the field of pattern mining - often referred to

20 One may argue that the support of a single label is sufficient. In the context of pattern min-

ing, multiple labels might be transformed to a concatenated single label. However, applying dimensional information within the mining process is not possible. Thus, the differentiation is mentioned at this point.


as on-line analytical mining (OLAM, cf. Han et al. (1999)) - it will only be a

matter of time until algorithms take hierarchies into account when search-

ing for patterns within time interval datasets.

3.2.3 Visual Analytics

The term visual analytics was coined by Pak Chung Wong, Thomas (2004).

In general, visual analytics has the purpose of analytical reasoning by us-

ing interactive visual interfaces (Thomas, Cook 2005). To create a good

interactive visual interface, Shneiderman (1996) stated that "a useful start-

ing point for designing advanced GUIs is the Visual lnformation-Seeking

Mantra: overview first, zoom and filter, then details on demand". In addition,

Shneiderman stated that a good visualization is task dependent. Thus, the

key task of an information system is to provide aggregated information in

real-time and requested filtered data on demand. To achieve that, a flexible

and performant data structure is necessary (cf. section 3.3). To enable the

creation of task dependent visual interfaces, it is also necessary that the

information system offers an interface to request and receive data (cf. VIS-

01, VIS-05). Several proprietary software tools are commonly used to cre-

ate such interfaces, e.g., Tableau©21, Google Fusion Tables22, or Datawrap-

per23. Nevertheless, several publications introduce new visualization tech-

niques in the field of time interval data analysis, so far unsupported by any

proprietary software.

Aigner et al. (2007) give an overview over the variety of techniques pre-

sented over the last years, useful to visualize time-oriented data, i.e., in-

cluding time interval data. One of the techniques presented in the context

of time interval data, is the Cluster Viewer introduced by van Wijk, van Se-

low (1999). The visualization shows a combined representation of daily pat-

terns and clusters, whereby patterns are shown as graphs and clusters are

shown on a calendar. Lammarsch et al. (2009) introduced an interactive

21 http://www.tableau.com/ 22 https://support.google.com/fusiontables/ 23 https://datawrapper.de/


visual method incorporating the structures of time within a pixel-based vis-

ualization called GROOVE (granular overview overlay). The visualization

enables the users to gain new insights into different temporal patterns by

interactively changing the order of granularities while keeping the same set

of granularities. Figure 3.4 shows examples of the two visualization tech-

niques.

Figure 3.4: Examples of the visualization techniques Cluster Viewer (van Wijk, van Selow 1999) and GROOVE (Lammarsch et al. 2009).

Regarding the handling of time-oriented data within the context of visual

analytics, Rind et al. (2013) developed a software library called

TimeBench24. The library provides data structures and algorithms to handle

time-oriented data in the context of visual analytics. TimeBench is available

as open-source project and the underlying data model is based on a dis-

crete, linear, bounded temporal model (cf. section 2.1.3). Furthermore, the

implementation utilizes relational data tables and time-specific indexing

structures to increase performance. As mentioned by the authors, it is "de-

signed mainly for developing research prototypes". Considering the perfor-

mance, the publication mentions runtime tests with up to 5,115 temporal

objects. Thus, the library has not been tested using larger datasets (i.e.,

several million temporal objects as the real life dataset used in section 8.2).

24 http://www.timebench.org


In general, different techniques (e.g., binned aggregations, statistical sum-

maries, or sampling) are used to realize real-time visualization of large da-

tasets (Liu et al. 2013). To apply these technique pre-aggregates are cal-

culated and held in memory. Thus, the possibility of calculating and provid-

ing pre-aggregates may be an important feature when applying visual an-

alytics on large datasets (cf. DC-03).

3.3 Performance Improvements

The performance of an implementation is typically improved by optimiza-

tion, i.e., using enhanced, faster, and specialized algorithms. In the case of

an information system useful for time interval data analysis, the algorithmic

part is one optimization criterion. However, the system - as information pro-

vider - has to ensure that the requested data is provided as fast as possible

(cf. 3.2.3). Thus, special data structures, i.e., indexes, have to be imple-

mented to ensure a fast data retrieval. In addition, the aggregation of data

is one of the pre-dominant operations used in the context of data analysis

(cf. section 2.1.2). Therefore, increasing the performance of aggregate

computation or providing pre-computing frequently used aggregates are

other possibilities to increase performance. Finally, caching strategies can

be applied to increase the performance.

In the following sections the current state of the art, regarding the men-

tioned capabilities available to increase the system’s performance, is intro-

duced. In section 3.3.1, different indexing technique used within the context

temporal data are introduced. In section 3.3.2, ideas on how to increase

aggregation performance are presented and in section 3.3.3 different cach-

ing strategies are discussed.

3.3.1 Indexing Time Interval Data

In general, an index is a data structure used to increase the query perfor-

mance when retrieving data from a dataset (or a database). Typically, the

increased performance for the retrieval decreases the performance when


inserting or updating data. The reason is the additional effort needed to

insert or update the index (i.e., the data structure) based on the added or

modified data. Depending on the type of data (e.g., primitive, strings, ob-

jects, key-value pairs, documents, spatial, temporal, or multimedia), the

storage type (i.e., main memory, secondary storage, clustered, or distrib-

uted), as well as the type of usage (e.g., mostly data retrieval vs. excessive

data updates/inserts) numerous data structures and handling strategies

(e.g., query optimization, pre-aggregates, or join-indexes) were presented

over the last decades (DeWitt et al. 1984; Chan, Ioannidis 1998; Gui et al.

2011; Garcia-Molina et al. 2014, pp. 333-360; 607-688). Regarding the field

of time intervals, several indexes were introduced enhancing the perfor-

mance when retrieving data using specific temporal operators. In general,

the different types of indexes can be categorized in tree-based or bitmap-

based.

Tree-Based Indexes

The IntervalTree (Edelsbrunner, Maurer 1981; Kriegel et al. 2001; Enderle

et al. 2004) is a tree-based data structure, which is optimized for overlap-

queries (i.e., which of these intervals overlap with a given interval). Never-

theless, the tree is capable to support all 13 temporal operators (Kriegel et

al. 2001). The relational implementation (Enderle et al. 2004) is based on

two B+-tree indexes (Bayer, McCreight 1972) and processes queries ap-

plying two steps. In a first step, the interval query is translated into several

range queries. Combining these queries to a single valid SQL query, which

is processed by the underlying DBMS, is the second, final step.

Another data structure introduced to store interval data is the Seg-

mentTree (Bentley 1977). The structure is based on a segmentation of the

underlying time axis (i.e., a partition of the time axis induced by the distinct

values of the intervals’ endpoints). Each node of the binary tree is a union

of its children. In general, the tree is optimized to perform contain-queries


(i.e., which of these intervals contain a given time point). Several optimiza-

tions for, e.g., higher dimensions or other temporal operators were pre-

sented during the last years (Berg et al. 2008; Dignös et al. 2014).

Bitmap-Based Indexes

In addition to the tree-based indexes, different bitmap-based indexes were

introduced within the field of data analysis and the area of DW, as well as

DSS. A bitmap is an array-like data structure containing 0 and 1. In general,

a 1 indicates that the entity associated to the position of the array is an

element of the set. A bitmap-index uses this feature, creating a bitmap for

each possible value of a property of an entity. Figure 3.5 illustrates a bit-

map-index for a color-property having three possible values: red, green, or

yellow. The bitmap-index indicates that the apple associated to position 3

(zero-based) is red.

Figure 3.5: Example of a bitmap-index containing three bitmaps, one for each possible value (i.e., red, green, and yellow) of the color-property.

Several different bitmap implementations have been published over re-

cent years, differing regarding the compression or encoding schemes. The

selection of the right schemes is crucial, considering the performance

gained and storage needed. Important criteria to select the best compres-

sion and encoding scheme are the queries to be expected (Chan, Ioannidis

1999), the order of data (Lemire, Kaser 2011), and the complexity consid-

ering the logic operations used within queries (Kaser, Lemire 2014).


Wu et al. (2009) implemented FastBit, a software tool used to query

scientific data efficiently using bitmap indices, out-performing popular com-

mercial DBMS in selected scenarios by a factor higher than ten. In addition,

several compression schemes based on run-length encoding (RLE) were

introduced, i.e., PLWAH (Deliège, Pedersen 2010), CONCISE (Colantonio,

Di Pietro 2010), EWAH (Lemire et al. 2010), and PWAH (van Schaik, Moor

2011). Recently, Chambi et al. (2015) presented a compression scheme

named Roaring based on packed arrays for compression instead of RLE.

Several evaluations indicate that Roaring can increase the performance by

a factor of 25 (Chambi et al. 2015; Meisen et al. 2015b).

Considering encoding schemes, Chan, Ioannidis (1999) introduced four

encoding schemes25: equality, range, interval, and membership encoding

scheme. The schemes define the constraints to be applied, i.e. equality:

v = propvalue, range: v ≤ propvalue, interval: v1 ≤ propvalue ≤ v2, and membership:

propvalue ∈ {v1, …, vn}. The presented encoding schemes are meant to be

used with discrete point data and are not directly applicable to time interval

data. In addition, Stockinger et al. (2004) developed evaluation strategies

to optimize the usage of bitmap-indexes for floating numbers by utilizing

binned bitmaps.

Considering temporal data, Roh et al. (2012) introduced an efficient bit-

map-based index for time-based interval sequences. The index aims to in-

crease the performance of similarity searches. Roh et al. assumes that an

interval sequence consists of non-overlapping and consecutive events, e.g.

phone calls handled by an operator. As mentioned and argued in section

3.2.1, this assumption is generally not valid. The first bitmap-based index

for time interval data was proposed by Meisen et al. (2015b). The index is

based on an array-like structure partitioning the time axis into its chronons

utilizing compressed bitmaps (Lemire, Kaser 2011) for each partition. The

index is presented and explained in detail in section 7.3.2.

25 The encoding scheme is the definition determining which of the bits are set to 1 in each

bitmap of an index.


3.3.2 Aggregating Time Interval Data

Aggregating data is one of the pre-dominant operations used in data anal-

ysis. To speed-up the execution of queries, techniques such as pre-com-

puting aggregates (Pedersen et al. 1999) or materialized views (Gupta,

Mumick 1999) have been proposed. In this section, techniques to increase

aggregation performance are introduced. The different aggregation tech-

niques (i.e., ITA, MWTA, STA, and TAT) are introduced in section 2.1.2.

In the field of temporal databases, Kline, Snodgrass (1995) presented

a data structures called AggregationTree, useful to store temporal aggre-

gates along pre-defined levels of the time dimension. Over the past years,

different enhancement, for the different forms of temporal aggregations (cf.

section 2.1.2), were presented (Zhang et al. 2001; Zimányi 2006; Zhang et

al. 2008; Gordevicius et al. 2012). Furthermore, other data structures like

the balanced Tree (Bongki Moon et al. 2003), SB-Tree (Yang, Widom 2003),

or multi-version SB-Tree (Zhang et al. 2001; Tao et al. 2004) were intro-

duced. Nevertheless, the solutions typically focus on one aggregation op-

erator (e.g., SUM(A)), do not support complex expression (e.g.

MAX(SUM(A + B)), cannot handle multiple filter criteria (e.g., aggregating

all red apples), or do not consider data gaps (e.g., missing values cannot

be handled). Böhlen et al. (2006) presented a tree-based implementation

for a temporal multi-dimensional aggregation technique (TMDA). The de-

fined TMDA operator supports ITA and MWTA aggregations, as well as dif-

ferent aggregation operators. Nevertheless, MODE or MEDIAN, along with

complex expressions are not supported. In addition, the presented imple-

mentation does not clarify how filter criteria are recognized.

Regarding the usage of bitmap-indexes, several publications introduce

the capability to speed up aggregate queries (Kaser, Lemire 2014). In ad-

dition, the result of an aggregation using bitmap indexes, can be easily kept

in memory and reused when applying further operations to the result, e.g.,

drill-down (Abdelouarit et al. 2013). Recently, Meisen et al. (2015b) intro-

duced a bitmap-based implementation for TAT (cf. section 2.1.2). The im-


plementation utilizes the bitmaps used for indexing, together with the logi-

cal and aggregation operators available for bitmaps (i.e., AND, OR, XOR,

NOT, and COUNT). The algorithm differs between three strategies depend-

ing on the property to be aggregated. A detailed explanation is presented

in section 7.3.4.

3.3.3 Caching Time Interval Data

Caching data can increase the performance of an information system dras-

tically. An important criterion is the frequent usage of the same data (e.g.,

the same query or the same data entity). In addition, incremental calcula-

tions (i.e., reusing a previous result) can be boosted by the utilization of a

cache. Research focuses on different aspects of caching, i.e., type of

caches (e.g., CPU, GPU, or main memory; cf. Handy (1998)), cache algo-

rithms (e.g., random replacement (RR), least recently used (LRU), or most

recently used (MRU); cf. Al-Zoubi et al. (2004)), or cache handling (i.e.,

coherence, coloring, or virtualization; cf. Hashemi et al. (1997), Sorin et al.

(2011)).

In the field of information systems, the focus considering caching is on

the utilized cache algorithm. As already mentioned, several different algo-

rithms were introduced over the last decades, defining which elements to

discard when the cache is full and new ones should be added. In general,

the most recent used algorithms are the ones already mentioned, i.e., RR,

LRU, or MRU. LRU and MRU are both algorithms which need a statistic to

be maintained and updated whenever an item is retrieved or discard from

the cache. In contrast, the RR algorithm does not need any additional im-

plementation effort when being utilized (Zhou 2010).

Regarding research, a specific caching strategy for time interval data is

not investigated and also not discussed in this book. Instead, concerning

this book, different caching implementations to ensure a fast retrieval from

secondary memory are discussed (cf. section 7.1.2), an extendable frame-

work is introduced (cf. section 7.3.3), and the use of cache algorithms is

discoursed (cf. section 8.2.2).


3.4 Analytical Query Languages for Temporal Data

A query language is generally utilized to retrieve data from, manipulate

data of, or define the schema of data contained in a dataset. In addition,

some statements defined within the query language may be used for au-

thorization purposes (e.g., grant access to a specific type of data) or or-

ganizational tasks (e.g., start and stop a transaction or use a bulk load).

Regarding temporal datasets, several query languages were defined within

the context of temporal databases, e.g., IXSQL (Lorentzos, Mitsopoulos

1997), ATSQL2 (Böhlen et al. 1995; Guo et al. 2010), SQL/TP (Toman

2000), or TSQL2 (Snodgrass 1995). More general query languages like

multidimensional expressions (MDX) defined for OLAP or the structured

query language (SQL) used in the context of relational databases, are often

used by analysts to solve analytical issues (Spofford 2006; Chamberlin,

Boyce 1976). Recently, a formal language for time interval data analysis

named TIDAQL was introduced by Meisen et al. (2015a). The language

TIDAQL is introduced and discussed in detail in chapter 5.

In the following, several statements are presented, each to retrieve the

needed resources for specific work-areas (i.e., being a work-area of the

department GH) and types within an hour of a specific day (i.e., the first of

January 2015). The statements are formulated using different languages,

i.e., MDX, ATSQL226, SQL, and TIDAQL. In addition, the issues arising us-

ing these types of query languages when analyzing time interval data is

explained27. The used database and the question to be answered are illus-

trated in Figure 23. The figure shows the intervals of the database for the

specified day (already filtered by the specified work-area for clarity) and the

expected answer, i.e., the needed resources for each hour of the day for

each work-area and task-type group (these are GH.Cleaning, long;

26 ATSQL2 is a query language supported by the currently only available temporal database system (TimeDB, http://www.timeconsult.com/Software/Software.html).

27 The processing performance is not considered as issue in this chapter. A detailed evaluation of the processing performance of different systems using different languages is presented in section 8.2.5.


GH.Cleaning, average; and GH.Cleaning, short).

Figure 3.6: Illustration of the question to be answered by the query: "How many resources are needed within each hour of the first of January 2015?"

The MDX statement used to retrieve the data from a cube having a

TIME, ORGA (i.e., WORKAREA), and TASK (i.e., TASKTYPE) dimension

defined, as well as a simple count measure is shown in Listing 3.1.

Listing 3.1: MDX statement used to answer the question regarding the needed resources.

WITH

MEMBER [MEASURES.NEED] AS

MAX(DESCENDANTS([TIME].[RASTER].CurrentMember, , LEAVES),

[MEASURES].[COUNT]), FORMAT_STRING = '#.##'

SELECT

CROSSJOIN(FILTER([ORGA].[UNIT].[WORKAREA].Members,

INSTR([ORGA].[UNIT].CurrentMember.UniqueName,

'[ORGA].[UNIT].[All].&[GH.') > 0),

{[TASK].[DUR].[TYPE]}) ON COLUMNS,

CROSSJOIN([TIME].[RASTER].[DAY].Children, [MEASURES.NEED]) ON ROWS

FROM [GH_DATA]

The first part (i.e., WITH MEMBER) of the statement defines the measure

used to calculate the maximum of all count-values for all leaves of the cur-

rent member. The second part (i.e., SELECT) specifies the dimensions to

be selected in the result, which are the filtered work-area (i.e.,

[ORGA].[UNIT].[WORKAREA].Members) and the task’s type (i.e.,

[TASK].[DUR].[TYPE]) on the columns, as well as the hours (i.e.,


[TIME].[RASTER].[DAY].Children) on the rows. Besides the obvious com-

plexity of the statement, the following issues regarding the query language

should be considered:

– the query is not intuitive (regarding, e.g., the order or the combination

of members), i.e., only an expert may be capable to understand and

formalize it,

– the name based filter has to be applied using a special FILTER function

instead of being defined within the WHERE part, and

– the calculation of the measure is not intuitive and error-prone, i.e., the

selection of the children of the lowest granularity.

In addition, the result may be incorrect if summarizability problems occur,

which is the case if the used tool does not support non-strict relationships.

Figure 3.7 illustrates the incorrect (left side) and the correct result (right

side), using the sample dataset.

Figure 3.7: Comparison of the result of the query from a system supporting non-strict relationships (right) and one that does not (left).

The ATSQL2 language was defined in the field of temporal databases

as extension of SQL. The syntax distinguishes between temporal and

standard statement modifiers. The language itself does not support any

dimensional aspects and also no two-step aggregation. Thus, it is difficult

to realize the mentioned query. In addition, the only available tool (i.e.,

TimeDB) does not support all language features, e.g.:

– the supported aggregation forms are ITA and MWTA (i.e., constant in-

tervals).

– like expressions cannot be used as filter criterion,


– order by was not applicable, and

– the tool did not consider multiple filter criteria for the same attribute,.

Nevertheless, Listing 3.2 shows the ATSQL2 statement determining the

different intermediate count-results for each minute and combining the in-

termediate results of an hour using max.

Listing 3.2: ATSQL2 statement used to answer the question regarding the needed resources.

NONSEQUENCED VALIDTIME

PERIOD [DATE 2015/1/1~00:00:00‐DATE 2015/1/2~01:00:00)

SELECT WORKAREA, TASKTYPE, MAX(VALUE)

FROM (

VALIDTIME PERIOD [DATE 2015/1/1~00:00:00‐DATE 2015/1/1~00:01:00)

SELECT WORKAREA, TASKTYPE, COUNT(VALUE) FROM GH_DATA

WHERE WORKAREA LIKE 'BIE%' GROUP BY WORKAREA, TASKTYPE

UNION

[…]

UNION




) HOUR_01 GROUP BY WORKAREA, TASKTYPE

UNION

[…]

UNION

NONSEQUENCED VALIDTIME

PERIOD [DATE 2015/1/1~23:00:00‐DATE 2015/1/2~00:00:00)

SELECT WORKAREA, TASKTYPE, MAX(VALUE)

FROM (




UNION

[…]

UNION




) HOUR_24 GROUP BY WORKAREA, TASKTYPE


The ATSQL2 query is not flexible regarding the selected dimensional level

and the time-window. In addition, writing such a query manually is signifi-

cantly difficult, because of the amount of statements to be united (i.e., one

for each chronon). Nevertheless, programmatically the query could be eas-

ily generated using a loop (i.e., iterating over the chronons and grouping

these for the selected dimensional level).

The next statement presented utilizes SQL to retrieve an answer re-

garding the needed resources. Listing 3.3 shows the statement, which is

based on additional PL/SQL functions and data types (cf. appendix: Pipe-

lined Table Functions (PL/SQL Oracle)). The statement creates a virtual

table (i.e, TABLE(F_DATES([…]))) containing all the chronons within a spe-

cific time-window. These chronons are combined with the descriptive val-

ues (i.e., WORKAREA and TASKTYPE) using full outer joins. The resulting

table is joined with the actual interval data, to be finally grouped in two

steps (i.e., first counting and then determining the maximum). The query

itself has to be strongly adapted whenever the descriptive values change

(i.e., instead of looking for work-areas and task types). Summarized, such

a statement may be formalized by an expert to retrieve some insights (as

mentioned, the performance is not considered at this point).

Listing 3.3: SQL statement used to answer the question regarding the needed resources. The presented solution is based on additional PL/SQL functions and data types which are shown in the appendix (cf. Pipe-lined Table Functions (PL/SQL Oracle)).

SELECT

"DATA"."HOUR" "HOUR", "DATA"."WORKAREA" "WORKAREA",

"DATA"."TASKTYPE" "TASKTYPE", MAX("DATA"."COUNT") "NEED"

FROM

(SELECT

META."START" "DATE", META."HOUR" "HOUR", META.WORKAREA "WORKAREA",

META.TASKTYPE "TASKTYPE", COUNT(1) "COUNT"

FROM


(SELECT

WORKAREAS.WORKAREA "WORKAREA", TASKTYPES.TASKTYPE "TASKTYPE",

DATES.start_date "START", DATES.end_date "END",

TO_DATE(TO_CHAR(DATES.start_date, 'yyyy‐MM‐dd hh24'),

'yyyy‐MM‐dd hh24') "HOUR"

FROM

(SELECT DISTINCT WORKAREA FROM GH_DATA

WHERE WORKAREA LIKE 'GH.%') WORKAREAS,

(SELECT DISTINCT TASKTYPE FROM GH_DATA) TASKTYPES,

TABLE(F_DATES(

TO_DATE('2015‐01‐01', 'yyyy‐MM‐dd'),

TO_DATE('2015‐01‐02', 'yyyy‐MM‐dd'))

) DATES

) META LEFT OUTER JOIN GH_DATA INTERVALS ON

META."START" < INTERVALS."END" AND

META."END" > INTERVALS."START" AND

META.WORKAREA = INTERVALS.WORKAREA AND

META.TASKTYPE = INTERVALS.TASKTYPE

GROUP BY META."START", META."HOUR", META.WORKAREA, META.TASKTYPE

) "DATA"

GROUP BY "DATA"."HOUR", "DATA".WORKAREA, "DATA".TASKTYPE

ORDER BY "DATA"."HOUR", "DATA".WORKAREA, "DATA".TASKTYPE

Last but not least, the query using TIDAQL is formalized in Listing 3.4.

As mentioned, the language itself is presented in detail in chapter 5 and is

illustrated here for the sake of completeness.

Listing 3.4: The TIDAQL statement used to answer the question regarding the needed resources.

SELECT TIMESERIES OF

MAX(COUNT(TASKTYPE)) AS "NEED" ON TIME.RASTER.HOUR

FROM GH_DATA IN [2015‐01‐01, 2015‐01‐02)

GROUP BY WORKAREA, TASKTYPE INCLUDE {('GH.*')}

3.5 Similarity of Time Interval Data

DA-07 formulates the requirement that an analyst has to be able to find

similar situations within the provided dataset. To implement and fulfill the


requested feature, it is necessary to define what similarity means. Regard-

ing sets of temporal interval data, three similarity measures are defined:

(1) an implementation based on relations among the intervals named

ARTEMIS (Kostakis et al. 2011), (2) an approach based on dynamic time-

warping (DTW) (Kostakis et al. 2011), and (3) IBSM (Kotsifakos et al. 2013)

a similarity measure based on the count of so called active intervals. In the

following, the three different measures are introduced.

The similarity of ARTEMIS is defined on Allen's interval relations (cf.

section 2.1.4). ARTEMIS calculates the distance between two sets of de-

termined event-interval relations using the Hungarian algorithm (Kuhn

1955), i.e., the minimal assignment costs are defined as the distance. To

speed up the distance calculation Kostakis et al. introduce a lower-bound

for ARTEMIS, useful when searching for, e.g., the k-nearest neighbors

(k-NN). Figure 3.8 illustrates the calculation of the ARTEMIS distance.

Figure 3.8: The ARTEMIS distance calculated for two interval-sets S and T.

In addition, Kostakis et al. present a distance measure based on DTW

(cf. Keogh, Ratanamahatana (2005)). The DTW-based similarity is based


on a sequence of vectors created for an interval set. The vector is based

on the start and end values of the intervals, i.e., the vector contains a 1 if

the interval is covering the chronon or 0 if not. Each interval has a specific

pre-defined position within the vector and a vector is created for each chro-

non a state change of an interval occurs (i.e., an interval starts or ends).

The distance of two vector sequences is calculated using the vector-based

DTW distance. Figure 3.9 exemplifies the calculation of the DTW distance

for two interval sets. The figure shows the determined vector sequences

and the mapping using the technique known as DTW.

Figure 3.9: The DTW distance calculated for two interval-sets S and T.

In 2013, Kotsifakos et al. presented IBSM (i.e., Interval-Based Se-

quence Matching). A set of intervals is represented by a matrix, which con-

tains the number of active intervals of a specific label for a chronon of the

discrete time axis. The distance between two sets is defined as the Euclid-

ian distance between the two matrixes. Figure 3.10 illustrates the calcula-

tion of the IBSM distance and the created matrixes.


Figure 3.10: Example of the IBSM distance calculated for two interval-sets S and T.

The results of the publications suggest that the application context is

important, to decide which similarity measure to use. ARTEMIS uses the

relations as an indicator for similarity, whereas the DTW vector-based ap-

proach compares the intervals point-by-point and out of their context. IBSM

considers explicitly the duration of the intervals for comparison and implic-

itly the relation. Nevertheless, each implementation by itself may be insuf-

ficient for a specific application context. In chapter 6 a combined, bitmap-

based similarity measure is introduced, which allows the user to weigh the

importance of the different factors, i.e., relation, duration, or group. The lat-

ter factor is not explicitly mentioned in any of the presented implementa-

tions, i.e., the label is assumed to be equal or not. Nevertheless, regarding

similarity it may be an important criterion to define how similar a label is,

e.g., by using dimensional information.

3.6 Summary

In this chapter, the state of the art regarding analytical information systems,

different approaches applied when analyzing data (i.e., OLAP, pattern and

3.6 Summary 71

association rule mining, as well as visual analytics), performance improve-

ments (i.e., indexes, aggregation techniques, and caches), query lan-

guages used to analyze time interval data, and similarity is presented.

The chapter forms the basis for the answers to the RQ presented in

section 1. In addition, it reveals the gaps regarding a holistic solution to

analyze time interval data and implicitly the steps needed to be performed

to close the identified gaps. On the one hand the requirements to apply the

different approaches available when analyzing data in general, must be

supported by the information system, i.e., data must be retrieved fast and

available in the needed form, summarizability must be guaranteed, and

generalizations, as well as specializations must be selectable. On the other

hand, performance improvements must be holistically applicable and the

system must provide a domain-specific query language, so that queries

are simply defined and understood easily. In the following sections, these

gaps are closed and a holistic solution in form of an information system

useful to analyze time interval data is introduced. The following section

deals with the basis to achieve this goal, a formal model of time interval

data.

4 TIDAMODEL: Modeling Time Interval Data

This chapter presents the answer to RQ2 "Which aspects must be covered

by a time interval data analysis model and how can it be defined". This is

achieved by defining a model based on the terms time interval, time interval

record, time interval dataset, descriptive value, descriptor, time axis, di-

mensions, descriptor hierarchy, and time hierarchy. These different terms

are categorized by the different elements of the tuple defining a

TIDAMODEL.

Definition 1: TIDAMODEL

A TIDAMODEL is a 4-tuple ,,, containing the time interval data-

base , the descriptors , the time axis , and the dimensions .

In the following sections, the time axis (section 4.1), the descriptors

(section 4.2), the time interval database (section 4.3), and the dimensions

(section 4.4) are defined. The definitions are motivated by the introduced

features requested for an analytical information system useful for time in-

terval data and the different aspects introduced in chapter 2 and 3. The

definitions follow the model defined in Meisen et al. (2014).

4.1 Time Axis

As motivated in section 2.1.3 a discrete, linear, bounded temporal model is

assumed for the context of time interval data analysis. Thus, the terms valid

time points, chronon, and data time points are defined as follows:

Definition 2: Valid time points, chronon, and data time points

The valid time points time are a finite, totally ordered set with relation .

A time point t ∈time is called a chronon28. In addition, data time points

28 The presented definition of a chronon is consistent with the definition of Dyreson et al. (1994, p. 55).


74 4 TidaModel: Modeling Time Interval Data

time are defined as the set of possible values representing time infor-

mation within the raw data. A single valid time point is typically denoted

by tin ∈time.

The definition of time could give the impression that an unbounded or

continuous temporal model29 is valid. This impression is correct, regarding

the raw data. Nevertheless, the definition of time ensures that data available

for the analysis are bounded and discrete (i.e., the set of valid time points

is defined to be finite). Based on the definitions of time and time the term

temporal mapping function is defined as follows:

Definition 3: Temporal mapping function

A temporal mapping function time is a function that relates each data

time point tin ∈time to a chronon t ∈time, i.e., time:time→time.

It should be mentioned that the implementation presented in section

7.3.1 always uses a UTC time zone on the lowest granularity and supports

other time zones by modeling an additional level within the dimensional

model (cf. section 4.4 and 7.2.1). Thus, the valid time points are assumed

by the system to be UTC based time points. Time points of other time zones

are mapped internally. The presented definition of a temporal mapping

function enables the realization of the feature requested as DA-06. In ad-

dition, the existence of a mapping function is closely related to the feature

request DI-03.

Prior to providing a formal definition of the term time axis, the term gran-

ularity has to be defined. The granularity is important information to realize

dimensional modeling (cf. section 4.4), as well as the features DA-01 and

DA-04. Without a granularity, the system cannot provide correct calcula-

tions required for aggregations. In addition, a roll-up to a higher level is

29 As argued in section 2.1.3 the usage of a continuous temporal model is, from an analytical point of view, not reasonable.

4.1 Time Axis 75

difficult to validate without knowing anything about the lowest granularity of

the system.

Definition 4: Granularity

The granularity tgrain is a unit of time. The information system has to pro-

vide a list of valid and supported units. In general, the following units

have to be supported: second, minute, hour, day, week, month, and

year.

The definition of a time axis is the basis for several feature requests and

further definitions presented in this chapter. As mentioned already, the fea-

ture requests DA-01, DA-04, DA-06, and DI-03 and the presented solutions

are closely related to the time axis definition. Thus, based on the definitions

presented in this section, the term time axis is defined as follows.

Definition 5: Time axis

A time axis is a 2-tuple (time, tgrain) containing the temporal mapping

function time used to relate the incoming data time points to the valid

chronons. In addition, the granularity tgrain specifies the unit of time of the

chronons.

Figure 4.1 illustrates an example of a time axis definition. The figure

shows a discrete, linear, bounded time axis containing values between 0

and 9 (cf. definition of time). In the example, a data time point is a timestamp

(in milliseconds) between 2000-01-01 00:00:00.000 and 2099-12-31

23:59:59.999 of the CET time zone. The defined mapping functions maps

each data time point, i.e. timestamp, to a value between 0 and 9. More

precisely, the timestamp is mapped to the "ones place" of the minutes of

the timestamp, e.g., 2000-01-01 10:56:12.432 CET is mapped to 6.


Figure 4.1: Illustration of a time-axis = (time,minute). The incoming data, i.e., timestamps (in milliseconds) between 2000-01-01 00:00:00.000 and 2099-12-31 23:59:59.999 from the time zone CET, are mapped to values 1-10 representing minutes.

4.2 Descriptors

As stated in the informal definition of a time interval (cf. section 2.1.1), prop-

erties are used to associate descriptive information to a time interval, e.g.,

to describe what was observed during the time. In this section the term

descriptor is defined, which is based on the definitions of the terms: de-

scriptive attribute, descriptive value, descriptor values, descriptive mapping

function, and fact function. In general, a descriptor is used to describe a

state, an observation, a statement, or a measurement being valid within

the time interval. Such a description can be defined by simple data type

(i.e., a string, a number, an integer, or a logical value). Nevertheless, the

incoming data may contain complex structures (e.g., arrays, lists, or ob-

jects) associating multiple values of the same property to an interval (e.g.,

for a task performed several qualifications, like speaking English, having a

driver license, being not pregnant, may be needed). The following defini-

tions of a descriptive attribute and a descriptive value cover these points.

Definition 6: Descriptive attribute and descriptive value

A descriptive attribute is a property defined by a label, naming the prop-

erty, and a set of possible values allowed for the attribute. In general, a

not further specified descriptive attribute is denoted by i, whereby a

4.2 Descriptors 77

named descriptive attribute is referred to by using the label, e.g., the

descriptive attribute gender is denoted by gender= {male, female}. A

value of a descriptive attribute is called a descriptive value of the attrib-

ute, i.e., in ∈ i.

From an analytical point of view, possible complex structures have to be

mapped to (multiple) simple data types (cf. feature request DI-01 and sec-

tion 3.2.1), so that the analytical information system is capable to answer

queries correctly. For example, assuming a descriptive attribute qualification,

defined as the power set of all possible qualifications, i.e., qualifica-

tion = ℙ({cleaning, fueling, check-in, English, French, German}) and a task

requiring the qualifications specified by the descriptive value {cleaning,

English}. If the user queries for all tasks requiring the qualification cleaning,

the system is not capable to reply correctly, without understanding that the

descriptive value is described by a set. Thus, the following formal definition

of descriptor values is presented.

Definition 7: Set of and descriptor value

i denotes the set of descriptor values of the descriptive attribute i. As

in the case of descriptive attributes, a labeled set of descriptor values is

denoted by the specified label, e.g., gender. A descriptor value ∈i is

an atomic entity, i.e., a comparable30 and not divisible data type or struc-

ture. In addition, the value has to be referable by a unique name, i.e.,

useful as unique identifier.

To bring descriptor and descriptor values together, a mapping function

is necessary. A descriptive mapping function is defined in the context of a

descriptive attribute i. It is used to map a descriptive value in ∈ i to a

subset of the defined descriptor. The formal definition is as follows:

30 At least comparable regarding equality, i.e., a relation exists.


Definition 8: Descriptive mapping function

A descriptive mapping function i of a descriptive attribute i and the

set of descriptor values i is defined as i: i →ℙ(i). A descriptive map-

ping function of a labeled descriptive attribute (e.g., gender) is denoted

by using the label as annotation (e.g., gender).

As motivated, the function maps a single descriptive value to a subset of

descriptor values. This enables the system to support many-to-many rela-

tionships as requested by feature DI-01. In addition, the feature request DI-

02 is covered by the existence of a mapping function, which can also be

used for validation, transformation, or cleansing purposes.

To enable data aggregation along a descriptor value (or a specified sub-

set of descriptor values), it is necessary to associate a numeric value to a

specific descriptor value. For example, assuming a descriptor value squad

∈ groupSize, someone would expect that the value 8 is aggregated for each

data element being described as a squad31. On the other hand, a descriptor

value ∈personnelNr= {00001, …, 99999} would be related best to a constant

fact of 1, e.g., to sum up the number of resources needed. Last but not

least, assuming a descriptive attribute temp ≙ , the descriptor values

temp= {high, middle, low}, and the descriptive mapping function

temp: temp →ℙ(temp)defined by temp(v)= {low} for v < 30, temp(v) = {middle}

for 30 v < 60, and temp(v)= {high} otherwise. An aggregation based on

temp, e.g., MEAN(temp), should aggregate the raw values, i.e. the descrip-

tive values.

Thus, when aggregating data the grouped data is combined based on

a defined aggregation function and an attribute specifying the values to be

aggregated. Therefore a fact function is introduced, used to specify a fact

value for a specific descriptive or descriptor value. Based on the previous

31 The typically group size of a squad is considered to be 8.

4.2 Descriptors 79

example, three different types of fact functions are introduced: value-invar-

iant, record-invariant, and record-variant. The implementation regarding

the aggregation of time interval data using these different fact functions is

presented in section 7.3.4.

Definition 9: fact function (value-invariant, record-invariant, record-

variant)

A fact function i is a function defined for a descriptive attribute i. A

value-invariant fact function relates every descriptor value ∈ i to a

constant number, i.e., i() = n, with n ∈ .Arecord-invariant fact func-

tion relates each descriptor value ∈i to a specific number, i.e.,

i→ . Finally, a record-variant fact function is defined by i:(i,i)→ .

The latter relates a 2-tuple, containing the descriptor value and descrip-

tive value, to a fact.

Based on the definition of a descriptive mapping function i, a set of

descriptor values i, and a fact function i the term descriptor is defined as

follows:

Definition 10: Descriptor

A descriptor di is a 2-tuple (i, i) containing the descriptive mapping

function i used to relate elements of the descriptive attribute i, i.e.,

descriptive values, to an element of the descriptor values i. In addition,

the tuple contains the fact function i, which is used to relate a de-

scriptor value to a number. Furthermore, is defined as the set of all

descriptor elements of the model.

Figure 4.2 illustrate a descriptor dlang. The descriptor describes lan-

guages spoken by persons and maps each language to the constant fact

1, using a value-invariant fact function. The descriptive mapping function

used in the example is the identity function, i.e., it maps each element of


lang to itself. Thus, regarding the example, langℙ(lang). Modifying the ex-

ample by assuming that lang contains a set of 2-tuples defining the lan-

guage spoken and a skill-level, i.e., {(German, 1.0), (English, 0.9),

(French, 0.2)}, exemplifies the need for a record-variant fact function.

Questions like "What was the minimal skill-level of the French speaking

speakers during 10:00 – 11:00" could be answered. Regarding the latter

example, it is necessary to modify the mapping function as well, so that a

set of tuples is mapped to a set of languages, e.g., {(German, 1.0), (Eng-

lish, 0.9), (French, 0.2)} would be mapped to {German, English, French}.

Figure 4.2: Example of a descriptor dlang = (lang, lang, lang), which uses an identity function to map the set of languages, i.e., the descriptive values, to the descriptor values.

4.3 Time Interval Database

This section aims to define the structure and modeling of the time interval

data, handled by the information system. To achieve this, the term time in-

terval is introduced formally, following the definition presented in section

2.1.1.

Definition 11: Time interval

Based on the definition of a time axis = (time, tgrain) a closed time inter-

val is defined as a subset of time denoted by [tstart, tend] and defined as

[tstart, tend] ∈ { t|t ∈time,tstart t tend}. In addition, an open time interval is

4.3 Time Interval Database 81

denoted by (tstart, tend) and half-open intervals are denoted by [tstart, tend), or

(tstart, tend].

It should be stated, that any half-open or open interval can be, because of

the discrete time axis, transformed to a closed interval by excluding the

open endpoint(s), i.e., (tx, tx+n) ≡ [tx+1, tx+n-1], [tx, tx+n) ≡ [tx, tx+n-1], and

(tx, tx+n] ≡ [tx+1, tx+n]. Thus, when generally using the term time interval, a

closed time interval is assumed.

As mentioned in previous sections, the time interval alone is of no rele-

vance for analytical purposes. An important asset is the descriptive infor-

mation unfolding what was observed, measured, stated, or collected. Thus,

a data model combining the temporal with the descriptive information is

needed. Thus, a time interval dataset is introduced to define the structure

of data.

Definition 12: Time interval dataset and time interval record

A time interval dataset data is defined as a subset of time × time×1 ×

… ×n. with the data time points time and the different descriptive at-

tributes i. An ordered tuple ∈ data is called a time interval record. The

objects of a time interval record are denoted by (start, end, 1, …, n). In

addition, the objects start and end form a valid time interval [start, end].

Based on the definition of a dataset, the definition of a time interval da-

tabase can be formulated.

Definition 13: Time interval database

A time interval database is a tuple (data,time,1, …, n), containing

the time interval dataset data, data time points time, and the descriptive

attributes i. Thus, a time interval database contains all data added to

the information system, and the possible values of the different descrip-

tive attributes and data time points.


Figure 4.3 shows an example database. Each time interval record of the

dataset stands for a task performed by a team for a department. The pos-

sible values of the descriptive values are specified by the respective de-

scriptive attribute team and department. Furthermore, the possible incoming

data time points are defined to be on second granularity and within the year

2010, cf. time.

Figure 4.3: An example of a time interval database = (data, time, team, department). The database contains tasks performed by teams (a team consists of several team members) and for the specified department.

4.4 Dimensional Modeling

Regarding the dimensional modeling introduced by Cood et al. (1993), a

dimension consists of hierarchies, which contain different levels, which

themselves are defined by their members. In addition, the different relations

(i.e., generalization or specialization) are specified. Several publications

stated that it is important to avoid summarizability problems when modeling

a dimension (Lenz, Shoshani 1997; Mazón et al. 2008, 2009, 2011; Niemi

et al. 2014). Nevertheless, many-to-many relationships between members

of different levels exist in real-life scenarios. Thus, the conceptual model

should not handle a many-to-many relationship as problems. Instead,

these problems have to be solved on a lower level of modeling (i.e., within

the logical or physical model by adding intermediate levels, bridging tables,

or denormalization, cf. Song et al. (2001)). However, the solution presented


in section 7.3.2 avoids any summarizability issues and ensures correct ag-

gregation when rolling-up or drilling down.

In this section, a dimensional model for descriptors, as well as the time

axis is defined. The time dimension is thereby regarded as an exceptional

case, because of the special characteristics of time (cf. section 2.1.6). First,

a descriptor’s dimension is defined following Meisen et al. (2014).

Definition 14: Descriptor dimension, hierarchies, levels, and mem-

bers

A descriptor dimension i of a descriptor di = (i, i)is a non-empty finite

set of descriptor hierarchies, i.e. i = { h1, …, hm }, whereby a descriptor

hierarchy hk is defined as a 3-tupel (V, G, L) satisfying the following

statements:

V denotes the set of members and i is a subset of V. The members

not being a descriptor value are denoted by V' := V \ i.

G is a directed acyclic graph G := (V, V × V) denoting the relationsamong the members of the hierarchy. rG denotes the one member

v ∈ V satisfying ∃!v ∈ V' : deg+(v) = 0. Additionally, G satisfies

∃v ∈ i : deg–(v) = 0 and ∀v ∈ V' : deg–(v) > 0. These assumptions

ensure that exactly one sink (a.k.a. root) exists, that this root isreachable from every member, and that every source (a.k.a. leaf) is

a descriptor’s value, i.e., is an element of i.

L specifies the hierarchy’s levels and is defined as a partially ordered

partition of V with binary relation ≼G and {rG} • L. Additionally, L sat-

isfies:

∀l1, l2 ∈ L, l1 ≺G l2 : (∀n1 ∈ l1, n2 ∈ l2 : max-dist(rG, n1) > dist(rG, n2) ⋀∃n1 ∈ l1∃n2 ∈ l2 : dist(n1, n2) ≠ ∞ ⋀ ∀n2 ∈ l2 ∄n1 ∈ l1 : dist(n2, n1) ≠ ∞)

This assumption guarantees that the descendant of a level (accord-

ing to the partial order ≺G) increases the distance to the root and

there exists at least one node of a level, which has a path to a prec-

edent level.


Figure 4.4 shows two descriptor hierarchies. Each is defined for a dif-

ferent descriptor, i.e., the one on the left is defined for a descriptor having

countries as descriptor values, whereby the one on the right is defined for

cities. Both hierarchies are valid according to the definition provided, i.e.,

only one sink exists, the leaves are elements of the descriptor values, and

each member of a level has a successor decreasing or keeping the dis-

tance to the sink.

Figure 4.4: Example of two descriptor hierarchies. The one on the left is based on the descriptor values specified by country and the one on the right is based on city. The example shows a non-strict (left) and a non-covering hierarchy (right). Both hierarchies are valid regarding the definition of descriptor hierarchies.

Next, a dimensional model for the time axis is introduced. As already

mentioned, the dimensional modeling of time is considered to be an ex-

ceptional case, because of the special characteristics of time. A chronon of

the time axis may contain additional information implicitly recognized. In

addition, when moving up a hierarchy, this implicit information may be in-

valid. Figure 4.4 illustrates the implicitly recognized information and the va-

lidity of information when rolling up the hierarchy, e.g., the 2000-01-06 is a

regional holiday, which does not apply on the month level for the member

January. When defining a hierarchy for the time dimension, the implicitly

recognized information may be taken into account, e.g., by specifying a

holiday level.


Figure 4.5: Example of implicit information recognized for the timestamp 2000-01-06 13:00 CET and the validity of the information when roll-ing up a hierarchy.

In addition, it must be possible to define the time zone32 a hierarchy ap-

plies to (cf. DA-06). When analyzing data across different time zones, it is

necessary to analyze data from a time zone perspective, as well as a

global, i.e., UTC, perspective (cf. section 2.1.6). Furthermore, it should be

mentioned that the implicitly recognized information may differ depending

on the time zone (cf. Figure 4.5, January is not a month of winter in every

time zone). Figure 4.6 illustrates three hierarchies and the different infor-

mation depending on the time zone. The time axis is based on the UTC,

whereby two of the three hierarchies use a different time zone, i.e., PDT

and CET. Thus, the value of "part of day" changes according to the time

zone. This observation also applies to the "type of day" value, which is set

to "school holiday" for the specified region "Poland, CET".

32 In addition, the region may be important information as well. However, the region has no impact on the time. Thus, it can be recognized by labeling the hierarchy, e.g., hGermany, CET.


Figure 4.6: Example of three different hierarchies for a time-axis. The values of the shown hierarchies differ, based on the time zone selected and the region utilized.

Definition 15: Time dimension, hierarchies, levels, and members

A time dimension time of a time axis = (time, tgrain) isanon-empty set of

time hierarchies, i.e. time = { h1, …, hm }, whereby a time hierarchy hk is

defined as a 3-tupel (N, T, L) satisfying the following statements:

N denotes the members of the hierarchy. The chronons of the

time axis are a subset of N, i.e., time ⊂ N.

T is a rooted plane tree T := (N, N × N), defining the relations amongthe members of the hierarchy. In addition, the depth of all leaves isequal and denoted by Tdepth. Furthermore, the set of all nodes ofdepth k is denoted by Nk and, to be consistent33, T is directed towardsthe root. The leaves of the tree specified by NTdepth are the chronons

of the time axis, i.e., NTdepth ≡ time.

L is a totally ordered partition of N, i.e., L ≔ {Nk | 0 ≤ k ≤ Tdepth} and

defines the levels of a time hierarchy. The relation is denoted by ≺T

33 A hierarchy of a descriptor dimension is also directed towards the root.

4.5 Summary 87

and defined as NTdepth ≺T … ≺T N1 ≺T N0. In addition, a total order for

each set Nk with 0 ≤ k < Tdepth is assumed and for NTdepth the total order

defined for time is applied.

The presented definition does not imply an explicit declaration of a time

zone. Nevertheless, the definition supports multiple hierarchies, e.g., one

hierarchy for each time zone needed. Supporting multiple hierarchies also

allows to support different hierarchies for the same time zone, but different

regions (e.g., it is possible to define a hierarchy explicitly for the region

"Bavaria, Germany, CET" and another one for the region "Hesse, Germany,

CET"). Figure 4.6 outlines three time hierarchies defined for the UTC, PDT,

and CET time zone. The hierarchies of the UTC and PDT time zone are

equal, except the additional level needed to map the UTC chronons to the

PDT time zone. Within the example the hierarchy defined for the CET time

zone uses a different hierarchy (i.e., after the mapping to the time zone a

"type of day" level is utilized).

Definition 16: Dimensions

The dimensions are defined as the set containing all descriptor di-

mensions (i.e., a maximum of one dimension per defined descriptor)

and a maximum of one time dimension, e.g., = { time, 1, …, n}.

4.5 Summary

Summarized, this chapter presented the TIDAMODEL, which is the answer

to RQ2 "Which aspects must be covered by a time interval data analysis

model and how can it be defined". The model is based on four aspects, i.e.,

– the time interval database: defining data pushed into the system,

– the time axis: modeling the discrete, linear, bounded temporal model,

– the descriptors: specifying the attributes (properties) describing the ob-

served, measured, or stated information, and

– the dimensions: defining the dimensional model for descriptors and the

time axis.


Figure 4.7 depicts the TIDAMODEL and the different elements. Besides

the mentioned elements, the figure illustrates a time interval data record

with n descriptors, one time dimension time, and one descriptor dimension

1.

Figure 4.7: Illustration of the TIDAMODEL showing all defined elements.

As already mentioned, the presented model is motivated by the features

listed in section 2.2, the characteristics of time (cf. section 2.1), and the

literature research regarding time interval data analysis (cf. chapter 3). Be-

low, several feature requests are enumerated and their impact, relating to

the definitions, is explained:

– DA-01 influenced the definition of the time axis, i.e., the definition of

chronons, the provision of a total order, and the mapping function.

– DA-03 was considered when specifying the time interval database, i.e.,

raw records have to be available, as well as the time axis, i.e., regard-

ing the support of temporal operators.

– DA-04 motivated the definition of the dimensional model, in particular

the modeling of the time dimension.

– DA-05 explains the need for descriptor dimension.

4.5 Summary 89

– DA-06 was extensively discussed in this section. The support of multi-

ple hierarchies and the understanding of time zones are important as-

pects for the implementation of the model (cf. section 7.3.1).

– DC-03 did not have immediate impact. Nevertheless, the dimensional

model defined was reviewed regarding the fulfillment of this require-

ment, i.e., if pre-aggregates may be applied.

– DI-01 forces the descriptive mapping function to relate descriptive val-

ues to a set of descriptive values.

– DI-02 motivated the introduction of a descriptive mapping function.

However, the implementation provides additional strategies to define

default behaviors (cf. section 7.2.1).

– DI-03 was recognized within the time axis definition, i.e., to support

such strategies, the time axis must provide needed information like

boundaries must be known, intervals must be verifiable.

– DI-04 is partially covered by the existence of mapping functions. Nev-

ertheless, as introduced in section 7.2.1 additional solutions are avail-

able.

5 TIDAQL: Querying for Time Interval Data

A query language allows the user to access data of the information system,

e.g., for further processing, visualization, for backups, or to test a hypothe-

sis by additional analysis. In any way, the acceptance of a query language

depends on several design criteria. Snodgrass (1995, pp. 282–284) intro-

duced six measures useful to make appropriate design decisions when

specifying a language: expressive power, consistency, clarity, minimality,

orthogonality, and independence. In addition, Catarci, Santucci (1995)

added the criterion ease-of-use. Table 5.1 lists the criteria and gives a short

description.

Table 5.1: Overview of the seven criteria used as basis for design decisions re-garding a query language.

Criterion Description

expressive power

The language must be suitable for its intended applica-tion and should not "impose undesirable restrictions".

consistency The syntax should be "internally consistent" and sys-tematically extendable. In addition, it should be inspired by standards.

clarity The syntax should "clearly reflect the semantics" and facilitate "formulating and understanding queries".

minimality The syntax should only add "as few as possible new reserved words".

orthogonality The reasonable numbers in a design are zero, one, and infinity. Thus, "it should be possible to freely combine query language constructs that are semantically inde-pendent".

independence Each function should be "accomplished in only one way".

ease-of-use The query language should be "closer to the user view of the reality". It should be "attractive and graspable". In addition, it should fit to the user’s knowledge and ex-pectation.


92 5 TidaQL: Querying for Time Interval Data

Besides the features requested regarding a query language (cf.

DA-01 – 05, DA-08, PD-02, DC-02), the criteria of Catarci, Santucci, and

Snodgrass are used as guideline. In the sections of this chapter, the time

interval data analysis query language (TIDAQL) is described. Meisen et al.

(2015a) outlined selected features of the language, which are introduced

in this chapter in detail. Furthermore, additional language elements, like

analytical results, are presented.

Following the SQL language, the statements of the language are cate-

gorized in three groups: data control language (DCL), data definition lan-

guage (DDL), and data manipulation language (DML). The chapter is di-

vided according to this classification, i.e. DCL is introduced in section 5.1,

the DDL is described in section 5.2, and the DML is presented in section

5.3.

5.1 Data Control Language

Currently, every system available within a network needs authorization and

authentication mechanisms, to ensure the correct and wanted usage of the

system. The DCL is used to control the access to data available. Addition-

ally it is used to define which statements are allowed to be queried by a

specific user or a user group. As mentioned in section 2.2, specific features

considering the security aspects of the system were not listed. However,

during the workshops several requirements were specifically formulated.

With regards to the DCL, two important aspects were mentioned: (1) the

existence of security mechanism, e.g., grant and revoke permissions, sup-

port roles, or delete users; (2) the permissions must be grantable for a spe-

cific model or on a general level, e.g., a user group should not be able to

add intervals to a specific model, but should be capable to generally select

data. Applying the design criteria mentioned, the presented DCL is close

to the one known from SQL. Thus, the commands: ADD, DROP, MODIFY,

GRANT, REVOKE, ASSIGN, and REMOVE are defined within the lan-

guage.

5.1 Data Control Language 93

To add a user or a role to the system an ADD command is provided.

The syntax of statements using the command is shown in Listing 5.1. When

adding a user, a name and a password must be declared. In addition, per-

missions can be granted and roles can be assigned to the created user. A

role is added by providing a name and, if needed, a comma separated list

of permissions. The query does not define the syntax of a permission, i.e.

any string is allowed. Nevertheless, a concrete implementation may vali-

date if the assigned permission is known and specify what kind of permis-

sions are allowed (e.g., wildcards may be supported, to grant all permis-

sions of a specific model to a user: 'MODEL.myModel.*' 34).

Listing 5.1: Syntax of statements using the ADD command of the DCL to add a user or a role.

ADD USER 'name' WITH PASSWORD 'password'

[WITH PERMISSIONS 'permission1' [, 'permission2', ...]]

[WITH ROLES 'role1' [, 'role2', ...]]

ADD ROLE 'name' [WITH PERMISSIONS 'permission1' [, 'permission2', ...]]

It may be necessary to drop a created user or role. In that case, the

DROP command can be utilized. The syntax of statements is given in List-

ing 5.2. In general, a user or a role should be droppable at any time. It

depends on the processing, if a logged in user can be dropped or if the

session has to be closed prior to a deletion. The same applies for a role,

which might be assigned to a logged in user.

Listing 5.2: Syntax of statements of the DCL, used to drop a user or a role.

DROP [ROLE|USER] 'name'

The modification of a role or a user is limited to specific values, i.e., the

name of a role or a user cannot be modified. Thus, the only value that can

34 myModel is an example for a unique identifier of a model loaded into the system (cf. section

7.2.1).


be modified within the DCL is the user’s password. One may argue that

granting or revoking a permission from a user or role is also a modification.

However, granting and revoking of permissions are processes, which are

logically separated from the modification of attributes from an entity. Thus,

the DCL introduces different commands to revoke and grant permissions,

namely REVOKE and GRANT. Listing 5.3 shows the syntax of statements

for all three commands, useful to modify a user’s password and grant or

revoke a permission from a user or a role.

Listing 5.3: Syntax of the statements using the commands MODIFY, GRANT, and REVOKE.

MODIFY USER 'name' SET PASSWORD = 'name'

GRANT 'permission1' [, 'permission2', ...]] TO [ROLE|USER]

REVOKE 'permission1' [, 'permission2', ...]] FROM [ROLE|USER]

The last commands of the DCL introduced are used to assign and re-

move roles from a user. When creating a user it is possible to assign spe-

cific roles to the user. However, so far it is not possible to assign new roles

to or remove a role from a user. Therefore, the commands ASSIGN and

REMOVE are presented in Listing 5.4. The syntax shows that the words

ROLE or ROLES are allowed. Tests have shown that the inexperienced

users tend to use the keyword ROLES instead of ROLE, when they assign

or revoke multiple roles at once. Regarding the ease-of-use criterion, both

keywords are valid according to the defined syntax35.

Listing 5.4: Syntax of statements for the commands ASSIGN and REMOVE, used to modify the roles assigned to a user.

ASSIGN [ROLE|ROLES] 'role1' [, 'role2', ...]] TO USER 'name'

REMOVE [ROLE|ROLES] 'role1' [, 'role2', ...]] FROM USER 'name'

35 The syntax also shows that the statement ASSIGN ROLE 'role1', 'role2' TO USER 'philipp' is valid. From a system perspective it does not matter if the statement is grammatically correct.

5.2 Data Definition Language 95

As mentioned in the beginning of this section and shortly discussed in

the context of the ADD command, one of the requests specified the kind of

permissions needed, namely permissions must be grantable on a global or

a model specific level. Even if not specified by the syntax, two different

types of permissions were implemented: GLOBAL.<permission> and

MODEL.<model>.<permission>. The first one is used to grant a permission

on a global level, e.g. the retrieval of data is generally allowed. The latter is

used to grant a permission for the specified model, e.g.,

MODEL.myModel.MODIFY would allow the user to modify the one and

only model with the name myModel.

5.2 Data Definition Language

The DDL is used for defining the TIDAMODELs available by the information

system. A model is defined by its database, time axis, descriptors, and di-

mensions (cf. chapter 4). Instead of defining statements to create or modify

each of these entities, the DDL provides three commands: LOAD,

UNLOAD, and DROP. The former command is used to load a specific

model by providing a definition-file, whereby the latter two are used to un-

load or delete a model. The UNLOAD command is used to remove the

model from memory, i.e., the model is not available anymore, but can be

loaded if needed. Instead, the DROP command removes all data belonging

to the model. Listing 5.5 shows the syntax of statements using the com-

mands.

Listing 5.5: Syntax of statements using the LOAD, UNLOAD, and DROP com-mands of the DDL.

LOAD [modelId|"modelId"|FROM 'location']

[SET autoload = [true|false] [, force = [true|false]]]

UNLOAD [modelId|"modelId"]

DROP MODEL [modelId|"modelId"]


As mentioned, the LOAD command can be used to load a model into

the system, by providing a location to a model-definition-file. In all other

cases, i.e., when providing a model-identifier like LOAD "myModel", the

model must be known to the system, i.e., must have been loaded from a

location before. Irrespective of whether or not the model was loaded from

a location or the system, additional properties can be set. These properties

are autoload (i.e., specifying if the system should load the model on start-

up) or force (i.e. specifies that the model has to be loaded from the location,

independent if another model with the same identifier exists already).

To utilize the statements to unload or drop a model from the information

system, a model-identifier has to be declared. When a model will be un-

loaded or dropped, depends on the implementation. E.g., it may happen

that a manipulation query is running, while another user fires a drop query

for the same model. Depending on the implementation, the drop may be

performed and an exception will be thrown, or the drop will be delayed until

all operations, dealing with the model, are handled. An implementation re-

garding these issues, as well as a definition of the model-definition-file is

presented in chapter 6.2.

5.3 Data Manipulation Language

A DML is used to insert, update, or select data from the database. Even if

the selection of data does not manipulate the persisted data directly, raw

data is manipulated (e.g., aggregated) during the processing. In this sec-

tion, the defined statements are divided in four groups. The first group con-

tains statements used to manipulate raw data, i.e. utilizing the INSERT,

DELETE and UPDATE commands (cf. 5.3.1). The second group, which en-

closes the GET and ALIVE command, defines statements useful to retrieve

metadata, e.g., like the defined models or dimensions, as well as the sys-

tems health (cf. 5.3.2). In section 5.3.3, statements utilizing the SELECT

command are introduced, useful to retrieve aggregated data along the de-

fined dimensions, raw data, as well as analytical results. The latter was


added to the DML to apply analytical functions, like data mining algorithms,

to selected groups of datasets (cf. section 2.2.1 and 2.2.2). Several feature

requests regarding the DML were already mentioned in section 2.2. In ad-

dition, the following subordinate features were requested: (1) the language

should provide a construct to enable a type of bulk load to increase insert

performance, (2) the language should support a construct to receive meta-

information from the system like the actual version, available users, or

loaded models, and (3) the syntax of the query language should support

intervals defined as open, e.g. (0, 5), closed, e.g. [0, 5], or half-opened, e.g.

(0, 5].

5.3.1 Insert, Delete, & Update Statements

In an analytical information system, the insertion of data is the most fre-

quently used statement to manipulate the raw data of the database. In gen-

eral, delete statements are performed much less and update statements

are rare. The reasons are clear, data is added to the system whenever the

interval is closed and the associated descriptive values are known. Adding

incomplete or uncertain time interval data to the system would affect the

quality of the analysis. Nevertheless, it occurs that added data is classified

as noise, e.g. by applying clustering algorithms, and therefore has to be

deleted. In addition, users may be able to update information, which was

assumed to be complete, within a source system. Thus, these updates

most be reflected within the information system.

Listing 5.6 illustrates the syntax of statements using the INSERT com-

mand. The statement specifies the identifier of the model, the structure of

the data to be added, and the values.

Listing 5.6: Syntax of statements using the INSERT command of the DML.

INSERT INTO [modelId|"modelId"] (id1 [, id2, ...])

VALUES (value1 [, value2, ...]) [,(value1 [, value2, ...]), ...]


The structure is defined by the identifiers of the descriptors, as well as the

reserved words [START] and [END], which specify the position of the tem-

poral start and end value (i.e., the interval). It is also possible to add a

minus (i.e, -) to specify the interval as open, e.g., [START-] or [END-]. An

example of a statement using the INSERT command exemplifies the men-

tioned aspects:

INSERT INTO myAppleObservations

(COLOR, CLASS, [START], WEIGHT, [END‐], FALL, DURATION)

VALUES ('red', '2', 09:45:12, '220', 09:45:48, '1.00', '0.45') .

The statement adds the time interval data used in the apple falling from

tree example (cf. section 2.1.1) into a model, which is loaded into the sys-

tem and named myAppleObservations36. It is noticeable that the temporal

information provided within the list of values does not use any apostrophe.

A temporal value is generally not marked and can be a date-time (the syn-

tax allows several different formats, i.e., ANSI INCITS 30-1997 (R2008),

NIST FIPS PUB 4-2, ISO 8601, and some non-standardized) or integer

value. The handling of integer values is defined by the time axis, i.e., the

semantical meaning of the number (cf. section 7.3.1). In the example the

interval is defined as half-open, i.e., [START, END). Thus, the system has

to interpret the temporal information 09:45:48 as 09:45:47 (assuming that

a second granularity is defined).

To add, e.g., several thousand time interval data records into the sys-

tem, a bulk load can be enabled. If the bulk load is enabled, the system

only updates indexes or persist data when needed, i.e., because it is run-

ning low on memory, until the bulk load is finished. Listing 5.7 shows the

syntax of the statement to enable (i.e., bulkload = true) and disable (i.e.,

bulkload = false) the bulk load.

36 The name of a model is specified within the configuration file of a model (cf. section 7.2.1).


Listing 5.7: Syntax of the statement to enable or disable bulk load for a model.

MODIFY MODEL [modelId|"modelId"] SET bulkload = [true|false]

The deletion of time interval data records added to the system is per-

formed using statements utilizing the DELETE command. The syntax of

such statements is illustrated in Listing 5.8. As shown, the declaration of a

record identifier is necessary. The deletion of records by filter criteria (e.g.,

as known from SQL) is not supported. As mentioned, the deletion of a rec-

ord is decided on a record level. Thus, the record identifier is known, e.g.,

by selection or from a result of an analysis.

Listing 5.8: Syntax of the statement to delete a specified record from a model.

DELETE recordId FROM [modelId|"modelId"]

Updating a time interval record is, as a delete statement, based on the

record’s identifier. Within an update statement, all information can be mod-

ified with the exception of the record’s identifier. The syntax of a statement

using the UPDATE command is illustrated in Listing 5.9. Unlike an insert

statement, an update statement can only include a single record. Thus, the

syntax only supports one value list.

Listing 5.9: Syntax of statements using the UPDATE command of the DML.

UPDATE recordId FROM [modelId|"modelId"] SET (id1 [, id2, ...])

VALUES (value1 [, value2, ...])

5.3.2 Get & Alive Statements

For an information system used in a productive environment, some addi-

tional non-data related information must be available. On the one hand,

these information may be provided by an API (e.g., via web interface using

JSON or libraries), on the other the user may want to use the information

within a report, a dashboard, or any other proprietary tool using a database


connection. To support the latter, the GET and ALIVE command are added

to the DML. Some may argue that such commands are not part of a DML.

However, read-only queries are often considered to be part of DML.

Listing 5.10 shows the available syntax for statements based on the

GET command. The language supports five different meta-information to

be retrieved. GET VERSION is used to retrieve the version of the infor-

mation system, GET MODELS provides a set of records containing the

available models, GET USERS returns a list of all users together with the

assigned permissions and roles, GET ROLES lists the roles and assigned

permissions, and GET PERMISSIONS responses with a set of all permis-

sions defined for the information system.

Listing 5.10: Syntax of statements using the GET command of the DML.

GET [VERSION|MODELS|USERS|ROLES|PERMISSIONS]

In addition, the availability of the system is of importance, e.g., to mon-

itor the services’ health. To provide a quick possibility to check the system’s

health, the ALIVE command is added to the DML. The system replies to an

alive statement with an empty set. If the system’s health is critical, the sys-

tem will not reply at all or throw an exception, which would lead to an ex-

ception on client side.

5.3.3 Select Statements

Most of the requested features mentioned regarding the analytical capabil-

ities of the information system, are dealing with select statements, e.g.,

several aggregation methods must be available (cf. DA-01, DA-02), the raw

time interval data records must be retrievable (cf. DA-03), dimensional op-

erations like roll-up and drill-down must be provided (cf. DA-04, DA-05),

time zones must be supported (DA-06, DA-07), and analytical results must

be creatable (cf. DA-08). To satisfy especially the ease-of-use, consistency,

and clarity criteria the select statements are grouped into three types: time

series, records (i.e., raw data), and analytical result.


Select Time Series

Listing 5.11 outlines the syntax of a statement to retrieve time series from

the system within a specified time window. The query determines a time

series for each group and measure specified. In addition, it is possible to

retrieve a transposed time series, which is necessary for some third-party

tools or libraries, e.g., the JFreeChart37 library expects transposed time se-

ries and is used by several Java based reporting and business intelligence

tools38. Also, the statement specified the model to retrieve the data from, as

well as the interval. An interval can thereby be defined using open, closed,

or half-open notation. Depending on the time axis the values of the inter-

vals’ endpoints must be integers or date-time values, e.g.:

[5, 10], [13.10.1981, 08.04.2005), (2014/10/05 09:58:00, 2014/10/09 16:12:00) .

Listing 5.11: Syntax of the select statement to retrieve time series of a specified time window.

SELECT [TRANSPOSE(TIMESERIES)|TIMESERIES]

OF measureExpr1 [AS "alias1"] [, measureExpr2 [AS "alias2"], ...]

[ON timeDimensionalExpr] FROM [modelId|"modelId"] IN interval

[WHERE logicalExpr] [GROUP BY groupExpr]

The syntax of the statement to select time series is based on several ex-

pressions, not further explained so far: measure expression (i.e.,

measureExpr), dimensional expression (i.e., dimensionalExpr, timeDimen-

sionalExpr, or descDimensionalExpr), logical expression (i.e., logicalExpr),

and group expression (i.e., groupExpr). Prior to introducing these different

expressions, the syntax of the statement to select time interval records and

analytical results is introduced.

Select Records

Selecting records from the system is an important feature for analytical

purposes (e.g., data mining algorithms), as well as explanation, e.g., to

37 http://www.jfree.org/jfreechart/ 38 E.g., Pentaho (pentaho.com), JasperSoft (jaspersoft.com), or YellowFin (yellowfinbi.com).


help the analyst to understand the result of an aggregation by presenting

the involved records. Listing 5.12 shows the syntax of a statement to select

records from the information system. Instead of retrieving the raw records,

it is also possible to count or just retrieve the identifiers of the records.

Listing 5.12: Syntax of the select statement to retrieve time interval records from the information system.

SELECT [RECORDS|COUNT(RECORDS)|IDS(RECORDS)]

FROM [modelId|"modelId"]

[EQUALTO|BEFORE|AFTER|MEETING|DURING|CONTAINING|STARTINGWITH|

FINISHINGWITH|OVERLAPPING|WITHIN] interval

[WHERE [logicalExpr|idExpr]] [LIMIT int[, int]]

The syntax introduces ten temporal operators: EQUALTO, BEFORE,

AFTER, MEETING, DURING, CONTAINING, STARTINGWITH,

FINISHINGWITH, OVERLAPPING, and WITHIN. The interested reader

may notice that Allen introduced thirteen temporal relationships (cf. section

2.1.4). When using a temporal relationship within a query, the user is ca-

pable of defining one of the intervals used for comparison. Thus, the in-

verse relationships (i.e. inverse of meet, overlaps, starts, and finishes) were

removed, because these are not needed. Instead, the user is capable to

modify the self-defined interval. Furthermore, the WITHIN operator is

added to retrieve all intervals having at least one common chronon with the

time window. Figure 5.1 depicts the available operators and the relations

covered. In addition, an example is provided illustrating the intervals ful-

filling the query.


Figure 5.1: Illustration of the provided temporal operators and there correspond-ing temporal relation.

Regarding the utilized expressions, the select records statement uses a

logical expression or an identifier expression to filter the received set of

records. Within the next subsection, the statements to select analytical re-

sults are presented. Thereafter, the different expressions are introduced

and discussed in detail.


Select Analytical Results

Analytical results can be queries by using the ANALYTICALRESULT key-

word within a SELECT statement. An analysis is defined within the infor-

mation system, i.e., by providing a script or an implementation. The system

fires specified select time series or select records statements and streams

the result to the specified algorithm. In addition, parameters may be defined

to configure the algorithm. Listing 5.13 illustrates the syntax of the select

analysis statement. The algorithm is referred by named (cf. section 7.2.2)

or directly by specifying the full-qualified class.

Listing 5.13: Syntax of the select statement to retrieve analytical results from the information system.

SELECT ANALYTICALRESULT OF /statement1/ [, /statement2/, ...]

USING ['algorithm'|'class']

[SET param1 = 'value1' [, param2 = 'value2', ...]]

In the following, the different introduced expressions are defined and

examples are presented, starting with the introduction of the measure ex-

pressions introduced in the context of the statement to select time series.

Measure Expressions

Measure expressions are based on facts provided and associated with the

descriptors of the model (cf. section 4.4). An expression is defined by de-

scriptors, mathematical, and aggregation operators, e.g.:

SUM(DESC1 * (DESC2 / DESC3)) + MIN(DESC4) .

In general, the aggregation operator is not specified within the syntax of a

measure expression. The reason is extensibility regarding new operators.

The implementation presented in chapter 7.3.4 supports the definition of

new operators programmatically. These operators can directly be used

within the query language without any additional effort. In addition, a meas-

ure expression can also be applied for a specific dimensional level. To sup-

port the TAT aggregation technique presented in section 2.1.2, a second


aggregation operator can be specified, if and only if a dimensional expres-

sion is specified within the query39, e.g.:

MAX(SUM(DESC1 * (DESC2 / DESC3)) + MIN(DESC4)) + MIN(COUNT(DESC1)) .

The select time series statement supports the STA and TAT aggregation

technique for measures using levels to specify the partition of the time axis.

A time series cannot apply any aggregation of equal results along the time

axis, which would be done when applying ITA or MWTA. If so, the result of

the query would not be a time series, i.e., the result would not have a value

calculated for each time point of the time window. However, ITA and MWTA

can be calculated in linear time, by iterating over the sorted values (e.g.,

by using an analytical function introduced later in this section).

Dimensional Expressions

In general, a dimensional expression is used to refer to a defined level of a

dimension (cf. section 4.4). A user utilizes a dimensional expression to roll-

up (generalize) or drill-down (specify) the different levels of a hierarchy. De-

pending on the type of dimension expected, the expression can be speci-

fied to be a time or a descriptor dimensional expression. Independent of

the type is the syntax of such an expression, which is exemplified as fol-

lowing:

DimensionIdentifier.HierarchyIdentifier.LevelIdentifier .

The expression consists of three parts, each referring to a specified part of

the dimension using a unique identifier40. Figure 5.2 shows a sample di-

mension named "World" that is identified by WORLD. The illustrated dimen-

sion has two hierarchies of which only one is shown, namely the hierarchy

39 The TAT expects the specification of a partition of the time axis, cf. Figure 2.4. 40 Unique according to its context, i.e., the dimension’s identifier is unique among all dimen-

sions, the hierarchy’s identifier is unique among all hierarchies of the specified dimension, and the level’s identifier is unique among all the levels of the hierarchy.


Geographic location, identified by GEO. The hierarchy GEO has three lev-

els, i.e., World (identified by *), Country (identified by COUNTRY), and City

(identified by CITY). Each of the defined levels has at least one member.

Figure 5.2: Sample dimension showing one of two hierarchies with three levels.

Following the presented syntax of a dimensional expression, an expression

to select, e.g., the level named Country, would be:

WORLD.GEO.COUNTRY .

Logical Expressions

A logical expression is used within a select statement to filter the time in-

terval data records retrieved. The query language supports the following

logical connectives: AND, OR, and NOT. In addition, the system supports

the equal operator and the usage of parenthesis to formalize complex log-

ical expressions. Furthermore, to specify multiple values, wildcards are

supported by the equal operator, e.g.:

NOT(DESC1 = 'A*' OR DESC2 = 'LESS') AND DESC3 = 'VALID' .

The example shows a logical expression filtering data by the descriptor

values of the specified descriptors. In addition, it is possible to use dimen-

sional expressions as filter criteria. In that case, the information system


selects all time interval records, which have a member on the specified

level with the specified value, e.g., assuming the dimension shown in Fig-

ure 5.2 the following logical expression filters all intervals associated to the

USA:

WORLD.GEO.COUNTRY = 'COUNTRY_USA' .

Identifier Expressions

Using a logical expression to filter data does not provide any possibility to

select records by their identifier. To enable the user to do so, identifier ex-

pressions are introduced. An identifier expression specifies a list of identi-

fiers, which should be returned, e.g.:

[ID] = 1, 5, 7, 12 .

Group Expressions

Group expressions are used to specify the groups of data to be aggregated.

A group expression can be based on several descriptors or a level of a

dimension. It is also possible to specify several criteria to form a group,

e.g., assuming a model with two descriptors temp= {high, middle, low} and

gender= {male, female}, the following group expression would generate six

groups, namely (male, high), (male, middle), (male, low), (female, high),

(female, middle), and (female, low):

GENDER, TEMP .

As already mentioned, it is also possible to use a level of a dimension as

grouping criteria. For example, assuming a third descriptor the descriptor

city {Aachen, Cologne, Jacksonville, San Francisco} within our model

and the dimension depicted in Figure 5.2. The following group expression

generates ten groups ((Germany, male), (Germany, female), (USA, male),

(USA, female), (Vatican City State, male), (Vatican City State, female), (Un-

known, male), (Unknown, female), (France, male), and (France, female)):

WORLD.GEO.COUNTRY, GENDER .


A group expression generates all groups, independent if data is associated

to the group or not. To include or exclude specific groups, a group expres-

sion utilizes the include and exclude keywords, e.g.:

WORLD.GEO.COUNTRY, GENDER

include {('Germany', 'male')} exclude {('*', 'male')} .

The above example would select six groups, excluding all the groups con-

taining male, but because of higher priority, including the group ('Germany',

'male'). The higher priority of include is chosen, because of usability rea-

sons. Asked users stated, that a specified include is typically more specific

than a specified exclude, i.e., when both keywords are used, the include

keyword is used to define the values which should still be included, even if

the exclude keyword states otherwise.

5.4 Summary

In this chapter the TIDAQL was presented. The following overview lists the

feature requests involving or addressing aspects relevant for the query lan-

guage. As shown and argued, the query language covers the desired fea-

tures.

– DA-01 and DA-02 influenced the definition of the query language re-

garding the aggregation operators. The query language supports any

kind of aggregation operators to be applied. Thus, from a language

perspective the requirement is fulfilled. The processing of aggregation

operators is introduced in section 7.3.4.

– DA-03 requests the existence of a mechanism to retrieve raw time in-

terval data. Thus, the selection of records is added to the DML. Tem-

poral operators were explained and introduced in detail (cf. Figure 5.1).

– DA-04 and DA-05 formalize requirements regarding OLAP operators

(i.e., roll-up and drill-down). As introduced, the selection of time series

supports the usage of dimensions and therefore roll-up and drill-down

operations. Figure 5.3 illustrates the operations for a time dimension

5.4 Summary 109

(from lowest granularity (minutes) to hours) and a descriptor dimension

(from work-area to an organization type).

Figure 5.3: Usage of the query language features ON and GROUP BY to enable roll-up and drill-down operations.

– DA-08 and PD-02 require the definition of a SELECT command to re-

trieve time series, as well as analytical results. Thus, the part of the

DML covering these requests is based on the defined requirements.

– DC-02 requests the existence of INSERT and DELETE commands.

Both commands are introduced and part of the language (cf. section

5.3.1).


In addition, the introduced language follows the guidelines of Snodgrass

and Catarci, Santucci regarding the mentioned design criteria: expressive

power (e.g., covering the requested features), consistency (e.g., following

the SQL standard, which is well known by most analysts; using the same

keywords across different statements), clarity (e.g., all statements can be

easily understood even by non-experts41), minimality (e.g., most of the key-

words are well known from SQL; additional keywords increased readability

and therefore the ease-of-use and clarity of the language), orthogonality,

independence, and ease-of-use (e.g., adding synonyms for specific tokens

like ROLES or FILTER BY instead of WHERE).

TIDAQL is the answer to the third RQ "How can a query language for

the purpose of analyzing time interval data […] be formulated". The pre-

sented language is, as mentioned, designed to fulfill the formulated fea-

tures of analyst working with time interval data on a daily basis. Neverthe-

less, in the future further features will arise, and the presented language

has to adapt to these new requirements. In section 8.1, the fulfillment of

the different features is evaluated and user comments regarding enhance-

ments are shown.

41 The feedback of the inexperienced users during the development of the language was very positive regarding the readability.

6 TIDADISTANCE: Similarity of Time Interval Data

The similarity between time interval datasets or e-sequences, as named

by Kostakis et al. (2011) and Kotsifakos et al. (2013), is a domain-specific

measure. Thus, a flexible distance measure is needed to determine the

similarity between two sets of time interval data. So far, three similarity

measures have been introduced, i.e., DTW and ARTEMIS (Kostakis et al.

2011), as well as IBSM (Kotsifakos et al. 2013). As described in section

3.5, these measures are different regarding the produced results. However,

which of these three techniques is exact regarding similarity is context de-

pendent, even if Kotsifakos et al. (2013) describe IBSM to be the more

precise technique42. In general, three different types of similarity can be

distinguished: order similarity, measure similarity and relational similarity.

ARTEMIS is a similarity measure fitting into the category of relational sim-

ilarity, whereas IBSM and DTW are measures categorized as order simi-

larity. In specific, the order similarity is a special case of measure similarity,

using count as measure (both DTW and IBSM utilize count as measure).

However, for some domains the order similarity may be useful as a base

similarity needed to implicitly include, e.g., gaps between intervals. Figure

6.1 illustrates the different types and an example of equal datasets, i.e., the

similarity is 100 % or in other words the distance between the sets is 0.

Regarding an information system, the examples depicted in Figure 6.1

motivate the need for a context dependent configuration of a similarity

measure. In this chapter, a similarity measure combining order, measure,

and relational similarity is introduced. The user is capable to weight the

influence of the different similarities, depending on the context. In section

7.3.5, the bitmap-based implementation is explained, which, as shown in

section 8.2.4, outperforms DTW, ARTEMIS, and IBSM. In the following sec-

tions, the different types of similarities are defined by introducing a distance

42 Which is the case comparing IBSM and DTW. The DTW implementation has several "false

hits", because of the possibility to warp. Nevertheless, comparing IBSM with ARTEMIS is difficult, because the algorithm compare different aspects of time interval data sets. Thus, it is like comparing apples and oranges.


112 6 TidaDistance: Similarity of Time Interval Data

measure for each type, i.e., temporal order distance in section 6.1, tem-

poral relational distance in section 6.2, and temporal measure distance in

section 6.3. In section 6.4, the similarity measure used to combine the dif-

ferent distances is defined.

Figure 6.1: Overview of the different types of similarity types, presenting an equality example for each type of measure.

6.1 Temporal Order Distance 113

6.1 Temporal Order Distance

The temporal order similarity ensures that the intervals are ordered similar

according to the temporal order. Equally labeled intervals, which meet each

other are, regarding the temporal order, considered to be equal to one in-

terval covering the same time span (cf. Figure 6.1). Thus, the number of

intervals is not considered to be a criterion for similarity. Instead, the num-

ber of occurrences of equally labeled intervals at a specific time point is

used to determine similarity. To compare the different amounts at a specific

time point, it is important to define which time points are matched. Regard-

ing temporal data, this is mostly dataset, or more precise time axis, de-

pendent. A possible strategy is to compare the time points with the same

offset, i.e., the first amount is compared to the first amount of the second

dataset, the second with the second, and so on. However, other strategies

may be better suited, like starting the comparison on the first Monday.

Therefore, the definition must include a function to match time points. Fig-

ure 6.2 illustrates two different matching strategies, which may be utilized

depending on the domain. The weekday match is used to match the first

weekday (e.g., 2015-01-01 was a Thursday) to the first equal weekday

(e.g., the 2015-02-05 was the first Thursday in February 2015).

Figure 6.2: Illustration of two different matching strategies, i.e., weekday and or-der match.


In addition, the handling of unmatchable time points has to be defined (e.g.,

comparing daily values from January with values from June, the 31th value

cannot be matched with any time point from June). Several strategies may

be considered, e.g.:

– comparing the unmatchable time point with 0,

– ignoring the unmatchable time point at all (i.e., using a distance of 0),

– resizing the series using, e.g., bilinear interpolation (cf. IBSM), or

– using a special technique as matching strategy (cf. DTW).

However, regarding temporal data, the bilinear interpolation or the usage

of special technique like DTW is typically a bad choice. In general, when

comparing, e.g., months on a daily basis, it makes sense to ignore un-

matchable time points and consider only matching time points. Based on

this explanation, the definition of the temporal order distance is presented.

Definition 17: Temporal Order Distance

Let S and T be two sets of time intervals. Furthermore, let S and T be

the totally ordered sets of time points for each set and L be the set of

all labels (i.e., groups) defined. In addition, the function match:

S→T∪{null}is defined as the function used to map a time point of S

to a time point of T or null, if the value cannot be mapped. Let the func-

tion count: L × ( {S} × S, {T} × T ∪ {null} ) → 0 be the function used to

count the intervals with a specific label at a specific time point. The dis-

tance TODist between S and T is defined as

TO ∶ to l, t∈ , ∈S

with

to l, t ∶ count l, S, t count l, T,match t .

The definition covers the need for the possibility to specify a matching func-

tion (i.e., match function), as well as a possibility to define how to handle

6.2 Temporal Relational Distance 115

unmatched time points (i.e., count function). The match function also co-

vers the usage of an interpolation function. The DTW based distance pre-

sented by Kostakis et al. (2011) is not covered by this definition. Neverthe-

less, the results of the application of DTW within the context of temporal

order is questionable and IBSM showed that a fixed time point based ap-

proach achieves better results.

6.2 Temporal Relational Distance

The list of possible temporal relations between two intervals is presented

in section 2.1.4. As mentioned, several definitions of relations exist. There-

fore, the definition of a distance measure should not oblige any set of tem-

poral relations. Nevertheless, a specific set of temporal relations and the

possibility to determine a unique relation between two intervals has to be

selected to apply the distance measure. The algorithm to calculate the dis-

tance determines the relations of a provided dataset and compares it with

the relations of the second dataset. In the case of ARTEMIS the Hungarian

algorithm is applied to match the different relations between the intervals

of the two sets. The definition presented in this section, utilizes the temporal

order given by the time axis to define how a set of relations is matched with

another. A relation is thereby associated to time points. This ensures that

the distance is comparable to other time point based distances (like the

one introduced in this chapter). Thus, a vector of the count of all relations

can be determined for each time point. Figure 6.3 shows an example of

assignments of relations to time points.


Figure 6.3: Example of assignments of relations to time points using Allen's (1983) relations.

The figure exemplifies that a relation is associated to specific time points,

e.g., the overlap relation between A (4) and A (2) is associated to the time

points covered by [1, 4]. In addition, to avoid redundancy, only one of the

paired relations is recognized, e.g., instead of using the relations ends and

ends-by, only the relation ends is considered (cf. section 2.1.4, Figure 2.10).

Table 6.1 shows the formulas used to calculate the time points covered by

a relation.

Table 6.1: Overview of the time points calculation for a specific relation.

relation rel

covered time points A ≔ [a1, a2], B ≔ [b1, b2] with A rel B

overlaps [b1, a2] begins [b1, b2] includes [b1, b2] ends directly before [a2, b1] ends [b1, b2] equal [b1, b2] before [a2 + 1, b1 – 1]

As mentioned in section 6.2, the support of matching strategies, as well

as unmatchable time points should be covered by the distance. Thus, the

temporal relational distance is defined as follows.

6.3 Temporal Measure Distance 117

Definition 18: Temporal Relational Distance

Let S, T, S, T, L, and matchbe defined as stated in Definition 17. Fur-

thermore, the function reltype: L × ( S × S, T ×T ∪ null → 0be

the function used to count the relations of a specific type (i.e., overlaps,

begins, includes, ends directly before, or equal) with a specific label at

a specific time point. The distance TRDist between S and T is defined as

TR ∶ tr l, t∈ , ∈S

with

tr l, t ∶ rel l, S, t rel l, T,match t .type∈

,…,

6.3 Temporal Measure Distance

The measure distance between two sets of intervals is determined by cal-

culating the distance between each measure for each time point of a group.

Thus, the challenges mentioned in section 6.1, regarding the matching of

time points, as well as the handling of unmatchable time points, also ap-

plies to this measure.

Definition 19: Temporal Measure Distance

Let S, T, S, T, L, and matchbe defined as stated in Definition 17. In

addition, let the function measure: L × ( S × S, T ×T ∪ {null}) →

be the function used to determine the measure of the intervals with a

specific label at a specific time point. The distance TMDist between S and

T is defined as:

TM ∶ tm l, t∈ , ∈S

with

tm l, t ∶ measure l, S, t measure l, T,match t .


The definition of the temporal measure distance shows that it is a general-

ized version of the temporal order distance. However, as argued earlier,

using the count function as measure, implicitly adds several temporal as-

pects to the distance. In addition, the existence of a measure distance al-

lows the comparison of specific, e.g., business-related, measures (e.g.,

find a day with the same use of resources).

6.4 Temporal Similarity Measure

All presented distances measures support the usage of a matching func-

tion and the definition of unmatchable time points. Nevertheless, to com-

bine the different distance measure to a single similarity measure (cf. DA-

07), it is necessary that the different values are normalized. Thus, each

distance calculated for a specific label at a specific time point is normalized

using the maximal distance achievable.

Definition 20: Temporal Similarity Measure

Let S, T, S, T, L, and matchbe defined as stated in Definition 17. In

addition, let maxto, maxtr, and maxtm be defined as the maximal distance

possible for a specific label and time point, i.e.,

max l, t ∶ max count l, S, t , count l, T,match t ,

max l, t ∶ max ∑ rel l, S, ttype , ∑ rel l, T,match ttype , and

max l, t ∶ max measure l, S, t ,measure l, T,match t .

Based on the maximal distance, the similarity is defined as

sim ∶ 1∑ w

to l, tmax l, t w

tr l, tmax l, t w

tm l, tmax l, t∈ , ∈S

amountofmatchedtimepoints ∗ amountoflabels

with wto, wtr, and wtm being the weighting factors, with wto + wtr + wtm = 1.

For simplicity, the division by zero (i.e., the maximal distance is zero) is

not handled within the formula. Nevertheless, if the maximal distance is

zero, the division is assumed to be zero, i.e., the distance is assumed

6.4 Temporal Similarity Measure 119

to be equal. A similarity of 1 means that the results are equal (i.e., sim-

ilarity of 100 %), whereby a similarity of 0 indicates that the sets are as

different as possible (i.e., similarity of 0 %).

The temporal similarity measure is the answer to RQ5 "What similarity

measure can be used to compare time interval datasets, enabling the

search for similar subsets". The presented solution covers three different

aspects of similarity: temporal order, temporal relation, and temporal meas-

ure. The importance of each aspect can be weighted by factors, depending

on the use case. In addition, it enables the analyst to use matching func-

tions and unmatchable time points, to specify which time points are rele-

vant for similarity.

7 TIDAIS: An Information System for Time Interval Data

In this chapter, an information system to analyze time interval data is pre-

sented. The system realizes the previously introduced TIDAMODEL, TIDAQL,

and TIDADISTANCE. The heart of the system is a bitmap-based data struc-

ture, which ensures a high performance when filtering and aggregating.

The chapter is structured as follows: First, the architecture of the system

is presented and motivated along the requested features, as well as the

already presented requirements arising from the definitions. In section 7.2,

an XML configuration for a model and the system is introduced. The follow-

ing section, i.e., section 7.3, presents selected challenges regarding the

implementation of the system’s components. In section 7.4 a prototype of

a web-based GUI is shown. The chapter concludes with a summary of the

presented results.

7.1 System’s Architecture, Components, and Implementation

The system’s architecture is depicted in Figure 7.1. The figure illustrates

the components and interfaces of the information system. Furthermore, the

provided services of the components are shown and the connections be-

tween consumers and services are illustrated.

The different components are motivated by the different features and

requirements defined within the previous chapters. First of all, a JDBC and

HTTP interface providing the data of the system is requested (cf. VIS-01

and VIS-05). In addition, a default GUI should be available to perform mon-

itoring tasks (e.g., check system health), administrative tasks (e.g., create

user or create roles), and visualize results (cf. CIS-03 and VIS-04). Another

request deals with the possibility to subscribe to events triggered by the

system. Thus, a scheduler and event manager must be available (cf. VIS-

02, PD-01, and MA-02). To support a query language (cf. chapter 5), the

system needs to parse and process the queries. In addition, an authenti-

cation and authorization instance is needed, ensuring the correct access


122 7 TidaIS: An Information System for Time Interval Data

to and controlled usage of the system. Another needed component, is re-

sponsible to push data into the system, or more specific a data retriever is

needed loading the generated data into the system. The heart of the sys-

tem is a data repository and a model manager. The former is needed to

handle data internally, which is pushed into the system (e.g., pre-pro-

cessing (cf. DI-02, DI-03, and DI-04), event generation, apply aggregation

operators, analyses (cf. MA-02 and PD-02), or indexing), whereby the latter

manages the models (e.g., validation, loading, unloading, and deletion).

Figure 7.1: The architecture of the information system showing the high-level components.

In the following, the components which are realized using available

open-source or proprietary libraries are listed, explained, and the used im-

plementation is mentioned. Afterwards, i.e. within the subsections, chal-

lenging components regarding the realization are introduced and de-

scribed in detail.


– Authentication & Authorization: The component validates any access

to the system. Thus, the most important tasks are user management

(i.e., manage users and roles, define permissions), session manage-

ment (i.e., providing a HTTP interface, which communicates across

several connections, forces the usage of sessions), and validation (i.e.,

who is accessing and which permissions are given). The implementa-

tion is based on the Apache Shiro43 framework, which "is a powerful

and easy-to-use Java security framework that performs authentication,

authorization, cryptography, and session management" (Apache Shiro

Group 2015). Apache Shiro supports authentication using pluggable

data sources, e.g., Lightweight Directory Access Protocol (LDAP),

JDBC, or Active Directory (AD). The information system integrates the

framework through an API, so that a replacement can be performed

transparently for the rest of the system.

– Data Retriever: The data retriever component is used to pull (e.g., by

polling or any wake-up) or push data into the system. In general, the

implementation offers an API to add pull or push data retrievers to the

system. Three base implementations of the API are implemented: read

data from file (i.e., CSV), retrieve data from a database (i.e., using a

SQL query), and load data directly from the configuration (cf. section

7.2.1). The implementation to retrieve data from a database is based

on the HikariCP44 connection pool manager, which is currently sup-

posed to be the fastest connection pool available (cf. Brett Wooldridge

(2015)).

– Scheduler & Event Manager: To enable the system to perform sched-

uled tasks and trigger notification on certain events, the scheduler and

event manager component is added. The scheduler utilizes the Quartz

Scheduler45 and offers the planned creation of services based on the

43 http://shiro.apache.org 44 https://github.com/brettwooldridge/HikariCP 45 http://quartz-scheduler.org/


available data. In addition, the event manager is a simple publish-sub-

scribe implementation using the default Java libraries (e.g., thread ex-

ecutor pools). The information system provides API to integrate other

event manager or scheduler. Thus, the use of, e.g., a Java message

service (JMS) based approach could be easily realized.

– Service Handler: Providing services to the outer world is an important

aspect of the system. The service handler component is responsible

for the provided service, i.e., starting, stopping, handling requests, and

providing the results. Because of the features requested, the default

implementation provides two services: (1) a HTTP service handling

data requests (e.g., using asynchronous JavaScript and XML (AJAX))

and (2) a JDBC service capable to handle requests using the available

JDBC driver. The HTTP service is based on the Apache HTTPCompo-

nents46 library, using the HttpCore component of the library to handle

HTTP requests. In addition, a minimal, fast, lightweight, and simple

JSON library, namely minimal-json47, is used to wrap the results when

responding. JDBC requests are, after authentication and authorization,

forwarded to the parser and processor of the query language. Thus,

further implementations are not needed.

– TIDAQL Parser & Processor: The language introduced in chapter 5 is

parsed and processed by this component. The parser of the language

was created using ANTLR448, a tool to create parsers based on a spec-

ified grammar. The processing utilizes the data repository to, e.g., re-

trieve aggregated data, or results of analyses. Thus, the processing is

not further introduced in the context of the language. Instead, the dif-

ferent aspects to create a result are presented while explaining the

data repository in detail (cf. section 7.1.1).

– TIDAMODEL Manager & Loader: The model manager and loader are

responsible to provide the definitions of a model, e.g., the descriptors,

46 https://hc.apache.org 47 https://github.com/ralfstx/minimal-json 48 http://www.antlr.org


the integration processes, and concrete implementations, as well as to

manage the availability of a model. These different responsibilities are

introduced in more detail in section 7.3.1. Nevertheless, from an imple-

mentation point of view the component is realized by handling the dif-

ferent objects representing a model. The creation and assembling of

these objects is done using the Spring framework49. Specifically, a con-

figuration following the definition presented in section 7.2 is trans-

formed into a bean configuration and loaded using a default bean-fac-

tory provided by the Spring framework.

– TIDAUI: The GUI is shown in the figure as an external component, i.e.,

not part of the TIDAIS. In general, the GUI utilizes the provided HTTP

interface to retrieve and interact with the system. Nevertheless, the in-

formation system is completely separated from the GUI and another

implementation could be utilized without changing the information sys-

tem. The GUI is presented in detail in section 7.4.

As mentioned earlier, within the next subsections, the not yet discussed

components are introduced, i.e.: the Data Repository, as well as the Cache

and Storage component. These components are presented in more detail,

because their architecture is more complex (i.e., several subcomponents

are needed and open-source or proprietary solutions are not generally

available).

7.1.1 Data Repository

The data repository is the component responsible for all data related tasks

like pre-processing, aggregation, or analyses. In addition, the internally

used data representation, as well as the index structure is managed and

utilized. The component consists of the following subcomponents: pre-pro-

cessor, aggregator, analyses manager, TIDADISTANCE calculator, and the

index structure. Figure 7.2 illustrates the components and the connections

between them.

49 http://projects.spring.io/spring-framework/


Figure 7.2: Detailed architecture of the data repository component.

For reasons of clarity, the figure shows only the connections regarding the

external interfaces update, get, retrieve and modify. The external interfaces

of the Scheduler and Event Manager (i.e., inform and assign) are con-

nected with every component capable to be observed (i.e., firing events).

In addition, the retrieve interface is used by all components, which need to

retrieve model information (i.e., the Analyses Manager, Aggregator, and

Pre-Processor). Each of these components is explained in the following:

– Pre-Processor: The pre-processor component is utilized whenever

data is loaded into the system. It is capable of accessing any available

data, so that complex integration processes can be realized. In addi-

tion, default cleansing steps, as requested by DI-02 and DI-03, are ap-

plied (cf. section 7.2.1). Finally, the mapping functions, as defined by

the model, are used to create a processed time interval data record.

The implementation is outlined in section 7.2.1.

– Aggregator: The aggregator component is responsible for providing ag-

gregation techniques (as mentioned and argued in section 5.3.3 the


supported techniques are STA and TAT). The component has to evalu-

ate the type of aggregation (i.e., the type of the aggregation depends

on the fact function of the descriptor), retrieve the needed data using

the index, and calculate the result. The algorithms used to determine

the result of an aggregation are presented in section 7.3.4.

– Analyses Manager: The main responsibility of the component is the

retrieval of results created through data analysis techniques. The man-

ager registers and instantiates the algorithms implemented against an

API provided by the system and defined by a model or the system’s

configuration. Whenever an analytical result is requested, the manager

checks the availability of the specified algorithm and triggers the exe-

cution. An analysis can be performed asynchronously and even on dif-

ferent machines. The implementation of the manager is not presented

any further, because it is mainly based on available core Java libraries,

i.e., collections, reflection, thread executor pools, and JMS.

– TIDADISTANCE Calculator: The component represents a concrete imple-

mentation of the distance introduced in chapter 6. The component is

developed against the analysis API of the system and the reference

implementation of it. The implementation is presented in detail in sec-

tion 7.3.5.

– Index Structure: The core of the data repository is the index structure.

The component ensures fast data retrieval. The different parts of the

implementation are presented in section 7.3.2.

7.1.2 Cache & Storage

The Cache & Storage component is responsible to store different entities

(e.g., a bitmap or a fact descriptor; cf. section 7.3.2) of the information sys-

tem. Figure 7.3 depicts the different subcomponents of the component, i.e.,

Cache, Storage Layer, and Usage Statistic Manager.


Figure 7.3: Illustration of the subcomponents of the main component Cache & Storage.

In the following the different components and their responsibilities are

introduced:

– The Storage Layer to be used differs based on the usage (i.e., type of

operations performed) and the type of data (e.g., complex objects,

plain old java objects (POJO)). It is generally not possible to select a

"best" storage. Thus, the system provides an API to implement any

storage, e.g., SQL databases, NoSQL databases, or other persistency

layers.

– The Cache is used to increase the retrieval performance from the stor-

age by caching the retrieved entities in memory. In section 3.3.3 several

caching algorithms are listed. The "best" algorithm to be used depends

on several factors, e.g., the amount of entities, the size of the available

memory, or the storage type. Thus, the component has to be flexible

regarding the used cache implementation and algorithm. Several open-

source caching libraries and frameworks are widely used, e.g.,

ehCache50 or OSCache51. The component’s data structure, API, and

used design patterns are presented in section 7.3.3.

50 http://ehcache.org 51 https://java.net/projects/oscache


– The Usage Statistic Manager is an optional component. It may be nec-

essary to provide the cache with a usage statistic so that the algorithm

can decide, which entities to remove from memory. In general, the

maintenance of this statistic decreases the performance of the system.

A discourse about the performance of the cache algorithms is briefly

discussed in section 8.2.2.

7.2 Configuration

The configuration of the information system can be separated in two differ-

ent parts. The first part deals with the configuration of the used compo-

nents. As mentioned in the previous section, it is important to ensure that

specific components of the system can be extended (e.g., add another

analysis algorithm), replaced (e.g., use a different authentication and au-

thorization framework), or modified regarding the behavior (e.g., change

the caching algorithm used by a cache). The second part addresses the

configuration of a model. A model is formally defined by a 4-tuple (, , , )

as defined in chapter 4 and loaded using a load statement (cf. section 5.2)

containing the location of a model-definition-file. Such a model-definition-

file must cover the formal definition and, in addition, override system spe-

cific settings, e.g., it may be necessary to utilize a specific indexing algo-

rithm for a model.

In this chapter both parts of the configuration are introduced and exam-

ples are given using excerpts of configurations. In section 7.2.1 the model

configuration is shown and in section 7.2.2 the system configuration is pre-

sented. The order in which the different configurations are presented, is the

other way around compared to the inheritance hierarchy (i.e., the model

configuration overrides the system configuration), due to it being easier to

motivate several configurable settings from a model perspective first.


7.2.1 Model Configuration

A model is defined using XML52. The root element of each model definition

is the model tag as shown in Listing 7.1. In addition, an identifier for the

model has to be specified using the id attribute. The identifier is used to

refer to the model, e.g., when requesting data using a select statement.

Optionally, a readable name for the model can be provided.

Listing 7.1: The skeleton of a model-configuration-file of the information system.

<?xml version="1.0" encoding="UTF‐8" standalone="no"?>

<model id="myFirstModelId" name="My first Model"

xmlns="..." xmlns:xsi="..." xsi:schemaLocation="...">

<!‐‐ Model Definition ‐‐>

</model>

The definition of a model is based on a time interval database, de-

scriptors, a time-axis, and dimensions. Within a model-configuration-file all

these items may be specified. In addition, several other components can

be configured, e.g., the Pre-Processor, the Cache & Storage, and the Index

Structure. In the following subsections, the configuration settings dealing

with different aspects of a model are presented. Afterwards, the additional

configurable settings regarding components are explained.

Defining a Time Interval Database

The definition of a TIDAMODEL includes the definition of the source, i.e., the

database from which raw data is retrieved (cf. section 4.3). In general, it is

important that the system supports several possibilities to load data into

the system. In order to provide the user with the greatest possible flexibility

and ensure usability for an inexperienced user, a source for the time inter-

val database, a so called data retriever, can be utilized. Asking the users

for the commonly used sources, revealed that time interval data is typically

stored in operational databases or CSV files. In addition, users mentioned

52 A complete model-configuration-file can be found in the appendix.


that for training purposes the definition of records within a model would be

desirable. Therefore, by default, the system provides a FixedStructureDa-

taRetriever, a DbDataRetriever, and a CsvDataRetriever. Furthermore, it is

possible to extend the system and provide additional data retrievers.

The time interval database of a model can thereby be defined in three

ways, i.e.:

– the database can be defined to be initially empty, and filled, e.g., by

insert statements, or

– the database can be filled by loading data from a data retriever (i.e.,

configuring a default data retriever or an extended implementation), or

– the database is defined to be static, i.e., the data is defined within the

model (internally the system utilizes the mentioned FixedStructureDa-

taRetriever).

From a configuration perspective, an empty database is configured by

changing nothing. The default configuration assumes that the data will be

loaded via the provided HTTP or JDBC interface, i.e., using insert state-

ments. If a data retriever should be used to load data from an external

source, the retriever has to be defined within the configuration. Listing 7.2

shows an excerpt, defining a data retriever with the identifier myDb using

the DbDataRetriever implementation (cf. DC-01).

Listing 7.2: Configuration of a data retriever within a model.

<dataretrievers>

<dataretriever id="myDb"

implementation="net.meisen.[...].dataretriever.DbDataRetriever">

<db:connection type="jdbc"

url="jdbc:hsqldb:hsql://localhost:300/db"

driver="org.hsqldb.jdbcDriver" username="SA" password="" />

</dataretriever>

</dataretrievers>

To configure the system to load data from the data retriever, it is necessary

to specify the query to be used, as well as the structure of the data records.


The former is specified within the data tag, which is positioned last in the

root. The latter is specified using the structure tag, associating the different

fields of the incoming data to descriptors or temporal information. Listing

7.3 shows an excerpt of the configuration defining a structure and a data

segment. The configuration defines data to be retrieved using the specified

query. The retrieved data is mapped according to the provided structure,

i.e., the field NAME contains the values to be used for the descriptor

PERSON, whereas the field START and END define the start and end val-

ues of the interval.

Listing 7.3: Configuration of a dataset and the structure of the set.

<structure>

<meta name="NAME" descriptor="PERSON" />

<interval name="START" type="start" />

<interval name="END" type="end" />

</structure>

<data dataretriever=" myDb">

<db:query>SELECT START, END, NAME FROM TABLE</db:query>

</data>

The data retriever sample exemplifies how the system realizes extensibility.

The required information like the structure of data, the data retriever, and

data to be retrieved, are fixed within the configuration. The kind of data

retriever, as well as the method how to retrieve data is extended. An exten-

sion for the system typically consists of a concrete implementation and cut-

points for the configuration, i.e., a XSLT and a XSD file named as the con-

crete implementation. In the case of the DbDataRetriever the extension

consists of the concrete class extending the abstract class BaseDataRe-

triever, several additional classes (e.g., exceptions or default values), an

XSD specifying the schema of the additional information, and an XSLT

used to define the beans needed when loading the configuration (cf. Figure

7.4).


Figure 7.4: The complete package of the DbDataRetriever extension used to load data from a database.

Listing 7.4 shows the DbDataRetriever.xslt defined to transform the

db:query tag from Listing 7.3 into a DbQueryConfig bean. The created

bean is passed to the instance of the specified data retriever (defined by

the attribute dataretriever of the data tag).

Listing 7.4: XSLT template used to create the bean used by the DbDataRetriever to define the query.

<xsl:template match="db:query">

<bean class="net.meisen.[...].dataretriever.DbQueryConfig">

<property name="query">

<value><xsl:value‐of select="normalize‐space(.)" /></value>

</property>

<property name="language" value="SQL"/>

</bean>

</xsl:template>

The formal definition of a time interval database expects, besides the

dataset data, the definition of the domain of the mapping functions, i.e., time,


1, …, n. These definitions are needed in the case of a formal definition.

Nevertheless, in the case of the configuration of a model, the domain of an

entity (i.e., time,1, …, n) is not of importance, as long as the mapping

function (i.e., the implementation) can be applied. More precise, assuming

the definition of a time interval database to retrieve data from, the domain

of the descriptive value of a specific value retrieved from the database is

irrelevant, as long as the value can be mapped to a valid descriptor value

(e.g., a string "5" can be mapped to the integer 5). This has to be ensured

by the implementation of the concrete descriptor and the mapping function

(cf. next subsection) and the system does not need the data type of the

descriptive value by configuration.

Defining Descriptors

The definition of a descriptor is based on a mapping function i and a fact

functioni (cf. section 4.2). Within the configuration a descriptor is a child

of the descriptors tag, which itself is a child of the meta tag.

Listing 7.5 shows a definition of three descriptors. The BABY descriptor

contains string values, whereby the descriptor values are specified by a

CSV file loaded using the CsvDataRetriever. The integer descriptor named

DURATION does not initially contain any descriptor value. In addition, an

extended descriptor called TOYS is added. The TOYS descriptor allows null

values (the attribute null is set to true), contrary to the other descriptors

(the default value of the null attribute is false)53. Furthermore, the BABY

descriptor overrides the index used internally for the descriptor values.

Overriding is typically not necessary, because the internal implementation

tries to find the best fitting index for the type of the descriptor values. Nev-

ertheless, the performance may be increased if a type or domain-specific

implementation is provided, e.g., for spatial data an R-Tree (Guttman 1984)

may be more appropriate.

53 A null value is often used as result of the mapping function, if the descriptive value cannot

be mapped to a valid descriptor value. In addition, it may be possible to add records, which have no value for a specific descriptor (i.e., a null value is applied).


Listing 7.5: An excerpt of a configuration defining three descriptors and de-scriptor values for one of the descriptors.

<meta>

<descriptors>

<string id="BABY" idfactory="net.meisen.[...].UuIdsFactory" />

<integer id="DURATION" />

<ext:list id="TOYS" name="toy list" null="true" />

</descriptors>

<entries>

<entry descriptor="BABY" dataretriever="csvBabyNames">

<csv:selector column="firstname" />

</entry>

</entries>

</meta>

By default, the following commonly used types of descriptors are avail-

able: integer, double, long, and string. These descriptors provide a pre-de-

fined mapping function (i.e., using the identity function), a set of descriptor

values (which is configurable and extendable), and a pre-defined fact func-

tion (i.e., a constant function returning 1 for string descriptors and the iden-

tity function for numeric descriptors). In addition, the system provides strat-

egies as requested by the feature DI-02. These strategies define how to

handle unknown descriptor values occurring in data pushed into the sys-

tem. The strategy to apply is defined within the data tag specifying the at-

tribute metahandling with one of the values: handleAsNull, createDe-

scriptorValue, or fail. The supported strategies are:

– handleAsNull: the system will try to associate the data record to a null

value for the unknown descriptor (i.e., the descriptor must support null

values),

– createDescriptorValue: the system will create a new descriptor value

for the descriptor and refer to the newly created descriptor, or

– fail: the system will throw an exception and the data record will not be

added.


If the mapping or fact function has to be modified, a new descriptor im-

plementation must be added, providing the mapping and fact function, e.g.,

as done with the list descriptor used in Listing 7.5. The extension of a de-

scriptor is similar to the extension of a data retriever, i.e., add a concrete

implementation, an XSD file for validation, and an XSLT file for transfor-

mation. In the case of descriptors, the system provides several base imple-

mentations, as well as useful base validations, and transformations.

Defining a Time Axis

The TIDAMODEL defines the time axis by a mapping function time, a set of

chronons time, and the granularity tgrain (cf. section 4.1). The definition of the

time axis is done using the timeline tag, which is a child of the time tag.

Listing 7.6 shows an example of a definition within the configuration. The

example shows that the definition of the granularity is done explicitly using

the granularity attribute. By default, the system provides the commonly and

additional granularities, e.g., month, week, day, hour, minute, second, or

even attosecond. In addition, the configuration defines the start and the

end, which together with the granularity, defines the set of chronons.

Listing 7.6: An example of a configuration of the time axis.

<time>

<timeline start="01.01.2000 00:00:00" end="31.12.2020 23:59:00"

granularity="MINUTE" />

</time>

Regarding the mapping function, the system provides three possibilities

to define one, i.e.:

– select a strategy on how to handle specific values,

– use a default mapping function, or

– provide a new mapping function.

The following strategies, as requested by feature DI-03, are realized to han-

dle missing endpoints of the defined interval:


– boundariesWhenNull: whenever a null value is found for an endpoint of

an interval, the system uses the time axis boundaries, i.e. the start (if

the start value of the interval is null) or end value (if the end value of

the interval is null),

– useOther: if one of the endpoints is null, the other not null endpoint will

be used as value, i.e., a time point is used, or

– fail: the system will throw an exception and the data record will not be

added.

If the existing strategies are not sufficient, it is possible to implement a new

mapping function. In general, such an implementation maps an incoming

value to an integer value (i.e., a long data type). The extensions of the map-

ping functions works similar to all extensions in the system, i.e., provide a

concrete implementation of the abstract class BaseMapper or Base-

MapperFactory, as well as the validation and transformation files for the

configuration.

Defining Dimensions

The TIDAMODEL introduces and defines two different kinds of dimensions,

i.e., dimensions defined for a descriptor and a time dimension. In general,

it is not necessary to define any dimensions. In that case, roll-up or drill-

down operations are not available and data can only be retrieved on lowest

granularity and aggregated on the defined descriptor values. Furthermore,

following the definition (cf. section 4.4), each dimension, independent of its

type (i.e., time or descriptor dimension), can have several hierarchies,

whereby each hierarchy has several levels. Finally, each level contains sev-

eral members. In the following, the configuration of the time and a de-

scriptor dimension is introduced.

The definition of a time dimension, states several constraints, which

have to be met by the dimension’s configuration, i.e.:

– the lowest level of a time hierarchy contains all chronons,

– each level of a time hierarchy forms a valid partition of the set of all

chronons, and


– a time hierarchy may be defined for a specific time zone.

The configuration of a time dimension is done within the timedimension

tag, which is a child within the dimensions tag. The configuration allows the

definition of mostly one time dimension, which is configured by specifying

at least one hierarchy. A hierarchy is thereby defined by the different levels,

which are defined as partitions of the chronons of the time axis. The order

of the levels within the configuration defines the roll-up and drill-down order

(from top as top-level, to bottom as lowest-level). Listing 7.7 exemplifies a

hierarchy for the CET time zone. The hierarchy is defined from top to bot-

tom as: all (default) → Year → Month → Day → Half Day → Hour → 5-

Minutes → Minute.

Listing 7.7: A sample definition of a time hierarchy within the time dimension.

<timedimension id="TIME">

<hierarchy id="RASTER" all="Everytime" timezone="CET">

<level id="YEAR" template="YEARS" />

<level id="MONTH" template="MONTHS" />

<level id="DAY" template="DAYS" />

<level id="HALFDAY" template="RASTER_DAY_MINUTE_720" />

<level id="HOUR" template="RASTER_DAY_MINUTE_60" />

<level id="HALFHOUR" template="RASTER_DAY_MINUTE_30" />

<level id="MINUTE5" template="RASTER_DAY_MINUTE_5" />

<level id="MINUTE" template="RASTER_DAY_MINUTE_1" />

</hierarchy>

</timedimension>

The design pattern of templates enables an easy way to provide new levels

to the system. A template has to define a valid partition of the time axis. In

addition, it has to fit into the current order, e.g., the DAYS template as-

sumes a predecessor template, which has a smaller granularity than days

(e.g., HALFDAY). Also, it expects the successor to have a granularity larger

than one day (e.g., MONTH). The raster template is a special template pro-

vided by the system. It is used to split a higher granularity into a partition


based on a smaller granularity, i.e., the RASTER_DAY_MINUTE_30 tem-

plate partitions each day into groups of 30 minute units. New templates for

a level can be easily added to the system by implementing the ITimeLev-

elTemplate interface54. Figure 7.5 illustrates the first three levels (from bot-

tom to top) of the defined hierarchy. The example shows the handling of the

time zone and DST.

Figure 7.5: Illustration of the first three levels (from bottom to top) of the hierar-chy defined in Listing 7.7.

The configuration of a descriptor’s hierarchy is also done within the di-

mensions tag. In contrast to the time dimension, the definition of a de-

scriptor’s hierarchy can be non-onto, non-covering, or non-strict. Because

of these differences, the configuration of a descriptor’s hierarchy differs

from the one of a time hierarchy. Furthermore, the configuration allows the

definition of several dimensions for descriptors, but at most one for each

descriptor. Listing 7.8 shows the configuration of a dimension for the de-

scriptor WORKAREA. The configuration contains one hierarchy with three

levels. The descriptor values are bound by regular expressions to the mem-

bers. If no regular expression is specified, the system assumes a member

of the hierarchy, i.e., the member is an element of V' (cf. section 4.4).

54 In addition, the system provides several helpful base implementations, e.g., BaseTimeLev-

elTemplate used to implement all templates provided.


Listing 7.8: A sample definition of a hierarchy of the descriptor WORKAREA.

<dimension id="DIM" descriptor="WORKAREA">

<hierarchy id="ROOMS" all="all rooms">

<level id="TYPE">

<member id="SUITE" rollUpTo="*" />

<member id="COMFORT" rollUpTo="*" />

<member id="STANDARD" rollUpTo="*" />

</level>

<level id="GUESTS">

<member id="BUSINESS" rollUpTo="*" />

<member id="PRIVATE" rollUpTo="*" />

</level>

<level id="FLOOR">

<member id="FLOOR1" reg="LVL1_.*" rollUpTo="PRIVATE, STANDARD"

/>

<member id="FLOOR2" reg="LVL2_*" rollUpTo="BUSINESS, COMFORT"

/>

<member id="FLOOR3" reg="LVL3_.*" rollUpTo="PRIVATE, SUITE" />

</level>

</hierarchy>

</dimension>

The defined hierarchy is non-strict, because a member of the FLOOR level

rolls up to two different members, e.g., FLOOR1 rolls up to PRIVATE and

STANDARD. Figure 7.6 depicts the configured hierarchy of the dimensions

of the WORKAREA descriptor.

Figure 7.6: Illustration of the hierarchy defined in Listing 7.8.


Configuring the Pre-Processor, the Scheduler & Event Manager,

the Cache & Storage, and the Index Structure

Besides the configuration of the model, a model-definition-file can be used

to configure the behavior of the components of the system, when handling

model dependent data. All components related configuration is done within

the config tag by adding the corresponding component’s tag as child. In the

following the different tags and possibilities to configure a component are

introduced, starting with the pre-processor (cf. DI-04).

Configuring a pre-processor to be used for transformation of the data

pushed into the model, is done within the preprocessor tag. The pre-pro-

cessor can be defined as any class implementing the IPreProcessor inter-

face, using the implementation attribute. In addition, cut-points can be used

to extend the configuration and enable pre-processor related settings. By

default, the system provides a ScriptPreProcessor, useful to specify a

script55 transforming the incoming data, e.g., using JavaScript, Groovy, or

Python. Listing 7.9 shows an excerpt of a configuration, defining a pre-pro-

cessor using JavaScript. The script is used to trim the descriptive value of

the myString descriptor. All other descriptive values and time points are

kept untouched by the script.

Listing 7.9: A pre-processor configuration using the ScriptPreProcessor.

<preprocessor implementation="net.meisen.[...].ScriptPreProcessor">

<spp:script language="javascript">

var result = new net.meisen.[...].PreProcessedDataRecord(raw);

result.setValue('myString', raw.getValue().trim('myString'));

</spp:script>

</preprocessor>

The scheduler and event manager can be used to define schedules fir-

ing specific queries, forwarding results, triggering events, and publish infor-

mation to subscribed instances. The configuration supports the definition

55 In general, any scripting language which is supported by the Java Scripting API can be

used.


of different schedules, which may also be used to push events to the event

manager. In general, the system publishes several core events (e.g., when

a query is fired), which can be subscribed to through the JSON interface

or by a schedule. Listing 7.10 illustrates the configuration of three sample

schedules.

Listing 7.10: A configuration specifying three sample schedules.

<schedules>

<schedule cron="10 0 * * *"

implementation="net.meisen.[...].MyJob"/>

<schedule cron="*/15 4‐16 * * 6,7"

implementation="net.meisen.[...].QueryJob">

<qj:query>SELECT COUNT(RECORDS) FROM myModel</qj:query>

<qj:handler>net.meisen.[...].QueryHandler</qj:handler>

</schedule>

<schedule event="core:query"

implementation="net.meisen.[...].EventJob" />

</schedules>

The first two schedules are based on a cron-expression56, whereby the third

one is assigned to a core event. The first schedule executes the mentioned

implementation every day ten minutes after midnight. The second schedule

fires the specified query every 15 minutes between 4am and 4pm on Sat-

urdays and Sundays. The result of the query is sent to the optionally spec-

ified handler, which, e.g., could create a report and send it to the manage-

ment or validate the result and notify a user via a message. The last sched-

ule is assigned to a core event, i.e., is triggered every time a query is fired.

The executed job retrieves event-specific information. In general, a job can

fire additional events, which then are handled by the event manager. As

already mentioned, the implementation is based on the Quartz Scheduler

and standard Java components.

The configuration of the Cache & Storage component allows the defini-

56 http://pubs.opengroup.org/onlinepubs/007904975/utilities/crontab.html


tion of the caches and storage implementations to be used for specific en-

tities of the system. These entities are raw records, record identifiers,

metadata, bitmaps, and sets of facts. To understand the configuration of

the caches and the storage, it is not important to understand these different

entities in detail. Nevertheless, a more detailed explanation of the entities

is given in the context of the implementation of indexes, i.e., in section

7.3.2. Listing 7.11 shows an example of a configuration, specifying the

cache and implicitly the storage to be used for the different entities. The

configuration defines a file-based storage for the record identifiers (identi-

fier), metadata (metadata), bitmaps (bitmap), and sets of facts (factsets).

For the storage of the raw records (records) a DBMS is utilized, using a

Hibernate57-based implementation. The extension and provision of new im-

plementations is done by implementing the provided interfaces (i.e.,

ICache) and specifying cut-points for the configuration as described before.

In addition, the example also shows the configuration of the caching algo-

rithm to be applied. In the example, the RandomCachingStrategy (cf. sec-

tion 3.3.3; cache algorithm RR) is explicitly used for the bitmap cache. If

the configuration of a caching algorithm is supported depends on the im-

plementation used, e.g., some implementations may not support the mod-

ification of the caching algorithm. Other settings, like the cleaning factor or

the maximal amount of cached entities, may by configurable. In the exam-

ple the default settings of the cache responsible for the sets of facts is over-

ridden.

Listing 7.11: Example of a configuration of caches for all entities of the system.

<caches>

<identifier implementation="net.meisen.[...].FileIdentifierCache" />

<metadata implementation="net.meisen.[...].FileMetaDataCache" />

<bitmap implementation="net.meisen.[...].FileBitmapCache">

<bfile:config strategy="net.meisen.[...].RandomCachingStrategy" />

</bitmap>

<factsets implementation="net.meisen.[...].FileFactDescriptorCache">

57 http://hibernate.org


<ffile:config cleaningFactor="0.2" size="1000000" />

</factsets>

<records implementation="net.meisen.[...].HibernateDataRecordCache">

<hib:config driver="org.hsqldb.jdbcDriver"

url="jdbc:hsqldb:hsql://localhost:7000/db"

username="SA" password="" />

</records>

</caches>

Another component, which can be modified by configuration, is the In-

dex Structure. The configuration allows to specify a factory deciding which

index to use for specific use-cases. The default implementation of the fac-

tory, i.e., IndexFactory, permits determining the used indexes for specific

data types. In addition, the used bitmap implementation can be specified,

e.g., to change the used compression scheme (cf. section 3.3.1). By de-

fault, the system provides several indexes based on different high perfor-

mance collections useful for primitive data types, i.e., Trove58, FastUtil59, or

Hppc60. Several benchmarks were performed to set up the implemented

IndexFactory, and to ensure an overall best performance (cf. section 8.2.1).

Nevertheless, context specific criteria may lead to better choices, which

can be configured, or by providing an own factory (cf. Listing 7.12).

Listing 7.12: An example configuration of the default IndexFactory, specifying the implementations used to index specific data types.

<indexes implementation="net.meisen.[...].IndexFactory">

<idx:config bitmap="net.meisen.[...].RoaringBitmap"

byte="net.meisen.[...].TroveByteIndexedCollection"

short="net.meisen.[...].TroveShortIndexedCollection"

int="net.meisen.[...].TroveIntIndexedCollection"

long="net.meisen.[...].TroveLongIndexedCollection"

default="java.util.HashMap" />

</indexes>

58 http://trove.starlight-systems.com 59 http://fastutil.di.unimi.it 60 http://labs.carrotsearch.com/hppc.html


7.2.2 System Configuration

In general, the configuration of a system should be as simple as possible

to increase the ease-of-use and help inexperienced users to get started.

Thus, the simplest configuration is one that is not needed at all61. Instead,

the system uses default settings, which can be overridden by providing a

configuration-file. A configuration-file can be used to define the default set-

tings for several components, replace an implementation or extend specific

features. The configuration-file is, like the model-definition-file, XML based

and has the skeleton shown in Listing 7.13.

Listing 7.13: The skeleton of a configuration-file of the information system.


<config xmlns="..." xmlns:xsi="..." xsi:schemaLocation="...">

<!‐‐ Configuration ‐‐>

</config>

The system’s configuration allows defining the implementation or the

settings of several components, i.e., the Authentication & Authorization, the

Service Handler, the Query Parser & Processor, the Index Structure, or the

Cache & Storage. In addition, the available templates for the time dimen-

sion, aggregation operators, analysis techniques, or granularities of time

can be defined. The structure of this section is a follows: First, the configu-

ration possibilities of the different components is introduced. Special focus

is on the components that cannot be defined within a model-configuration-

file, because the configuration of other components is similar to the one

presented in the previous section. Afterwards, the configuration capabilities

regarding the templates, aggregation operators, analysis techniques, and

granularities are introduced.

61 Nevertheless, a sample of a complete configuration-file is presented in the appendix.


System Configuration of Components: Authentication & Authorization,

Service Handler, and Query Parser & Processor

The configuration of the Index Structure and the Cache & Storage compo-

nent is similar to the one presented in section 7.2.1 and therefore not fur-

ther discussed. A sample configuration of the Authentication & Authoriza-

tion component is shown in Listing 7.14. The sample illustrates the usage

of the AllAccessAuthManager, with is mainly used for testing purposes. The

implementation accepts any username and password combination and

grants all permissions available to the logged in user. The second imple-

mentation deployed with the default system is the ShiroAuthManager,

which is based on the already mentioned Apache Shiro framework. The

implementation is meant to be used in productive systems and allows the

creation and management of users, roles, and permissions. In general, the

component can be replaced via configuration and cut-points may be used

to extend the configuration capabilities.

Listing 7.14: A sample configuration of the Authentication & Authorization compo-nent.

<auth>

<manager implementation="net.meisen.[...].AllAccessAuthManager" />

</auth>

The settings of the Service Handler component, responsible for for-

warding and accepting the request, as well as deliver the response, can be

modified with regards to the ports, timeouts, and availability. Listing 7.15

shows an excerpt of a system configuration-file. Within the example, the

ports are specified for the three interfaces: http, tsql, and control. The con-

trol interface was not introduced so far. It can be enabled to shut down the

server remotely. In addition, the http interface offers the possibility of defin-

ing the document root directory, i.e., the directory to look for website files.

If the attribute is not specified, the system will just start the services to

retrieve data via http in JSON.


Listing 7.15: Example of the system configuration of the Service Handler compo-nent.

<server>

<http port="7000" timeout="30" enable="true" docroot="www" />

<tsql port="7001" timeout="1800000" enable="false" />

<control port="7002" enable="true" />

</server>

Last but not least, the configuration of the Query Parser & Processor is

shown. The configuration allows the user to replace the query language

with an own, maybe domain-specific, language. The default implementation

supports the TIDAQL presented in chapter 5. The configuration is defined

as child of the factories tag using the queries tag. The implementation must

implement the IQueryFactory interface to be recognized by the system.

Listing 7.16 shows an excerpt defining the default QueryFactory to be used

by the system to parse and process incoming queries.

Listing 7.16: Example of the system configuration of the Query Parser & Proces-sor component.

<factories>

<queries implementation="net.meisen.[...].QueryFactory" />

</factories>

Extending the Templates, Aggregation Operators, Analysis Techniques,

and Granularities

Instead of replacing complete implementations for specific components,

the system supports the capability to extend the functionality by configura-

tion. The extendable functionalities are:

– add new templates for the time dimension (cf. section 7.2.1),

– specify new aggregation operators useable within the query language

(cf. section 5.3.3),

– define new analysis techniques (cf. section 5.3.3), and


– allow additional granularities of the time axis (cf. section 4.1 and sec-

tion 7.2.1).

The integration of the different extensions differs regarding the configura-

tion. Nevertheless, all extensions have in common that an implementation

has to be provided implementing the corresponding interface, i.e.,

ITimeLevelTemplate, IAggregationFunction, IAnalysis, or ITimeGranularity.

Regarding the configuration, the different techniques are explained in the

following, starting with the extension of a template for the time dimension.

Templates can be easily added, by adding the concrete implementation to

the configuration as shown in Listing 7.17.

Listing 7.17: Example of the system configuration to add an additional template.

<timetemplates>

<template implementation="net.meisen.[...].templates.WeekDays" />

</timetemplates>

Similarly, the extension of aggregation operators is defined (using the

aggregations tag instead of timetemplates tag and the function tag instead

of the template tag). Instead of providing a concrete implementation of a

template, a concrete implementation of an aggregation operator is pre-

sented. By default the following operators are added: count, min, max, sum,

mean, median, and mode (cf. DA-01). In addition, temporal aggregation

operators are available, i.e., count started and count finished (cf. DA-02).

Depending on the form of aggregation (cf. section 2.1.2) the application of

the temporal operator may be possible or not. Thus, several extensions of

the ITimeLevelTemplate are available, to specify the utilization of an oper-

ator (cf. section 7.3.4):

– ILowAggregationFunction (i.e., the operator must be applied to values

of the lowest granularity, e.g., SUM(DESC1)),

– IDimAggregationFunction (i.e., the operator aggregates results, e.g.,

SUM(MAX(DESC1, DESC2))), and


– IMathAggregationFunction (i.e., the operator is used to combine values

mathematically, e.g., SUM(5, 4, 7)).

Registering new analysis techniques to the system is also realized by

simply specifying the implementation (using analyses as parent tag and

analysis as child tag). The analyses manager collects the registered in-

stances and provides the implementation after resolving the name used,

e.g., within the query language. The configuration of the concrete analysis

may have additional configuration capabilities, which are defined using the

already presented cut-points technique.

Last but not least, the extension capabilities regarding the granularity of

the time axis are explained. The default implementation utilizes a time gran-

ularity factory, which applies several techniques to search for a granularity

definition on the class-path of the application. Thus, it is typically enough

to add the new implementation on the class-path and use the full-qualified

name when referring to the granularity. It is also possible to place the con-

crete implementation in one of the pre-defined packages (e.g.,

net.meisen.dissertation.model.time.granularity). If none of these tech-

niques are sufficient, it is also possible to just replace the factory’s imple-

mentation and provide an own factory instance.

7.3 Data Structures & Algorithms

This section deals with selected aspects of the realization, which were

challenging and are interesting regarding data structures and algorithms

used to create a performant, stable, and usable system. In section 7.3.1

selected features implemented to handle the configuration of models (i.e.,

validation and mapping) are introduced. In addition, the section presents

the internal handling of the time axis. Section 7.3.2 introduces the mainly

bitmap-based indexes used to process different query types. Several utili-

zations of the indexes are illustrated and discussed. The implementation of

the cache and storage interface is introduced in section 7.3.3. The pre-


sented implementation solves the handling of the garbage collection re-

garding cached items. In section 7.3.4, the algorithm to perform the ITA and

the TAT is introduced. The algorithm utilizes the different indexes to achieve

an excellent performance. The algorithm to calculate the distance and de-

termine the k-NN of an input query is introduced in section 7.3.5. The pre-

sented algorithm utilizes the provided indexes and introduces a pruning

technique to increase the performance.

7.3.1 Model Handling

A model is the heart of the information system. Handling data pushed into

a concrete, model-specific structure is introduced and discussed in the sec-

tions 7.3.2 and 7.3.3. The utilization of the structures used to calculate ag-

gregations and distances is presented in section 7.3.4 and 7.3.5. However,

the internal representation of specific elements of the model is presented

in this section, i.e., the time axis and descriptors. In addition, this section

presents selected algorithms, i.e., processing of a raw data record, valida-

tion of descriptor’s dimensions, as well as mapping of descriptive values

and data time points.

TimeAxis Data Structure

The data associated to specific chronons, is the most frequently requested

information within the system. As mentioned previously, internally a chro-

non represents, if the time axis is based on time, a time point in the UTC

time zone. Each chronon is thereby normalized, so that the start of the time

axis is represented by 0 and the end of the time axis is represented by the

amount of chronons - 1. If on the other hand the time axis is integer based,

i.e., the start and end values are specified by integers, no time zone is

applied. Figure 7.7 illustrates three configurations and the internal, normal-

ized representation. Assuming the definition of the time axis shown on top,

the value 2005-01-01 is mapped to the value 4. Regarding the time

2015-01-20 08:07:00 and the definition shown in the middle, the time is

normalized to 29,287. Using the definition of the time axis shown on the


bottom of Figure 7.7, the value 1981 is represented by 1931 (i.e.,

1981 - 50, because 50 is the defined start).

Figure 7.7: Three different time axis configurations and an illustration of the in-ternal representation as array.

The data structure used to realize a time axis must be capable of han-

dling large amounts of chronons and performant when iterating over a time-

window or updating associated information. To use the best fitting structure,

an algorithm evaluates the defined time axis definition (cf. section 7.2.1) by

determining the amount of chronons to be handled. Based on the result of

the calculation and the available memory for a model (configuration de-

pendent), the structure chosen differs between

– a dynamic array (i.e., internally a list collection is used, which is ex-

tended if needed),

– a fixed array (i.e., a typical array), or

– an extended array (i.e., if the expected size exceeds the memory or the

maximal size of an array62, nested arrays are utilized).

Independent of the chosen type of array, the resulting structure is capable

of retrieving an element for a specific integer value (internally the primitive

data type long is used, which allows a maximum of 263 – 1 elements). Thus,

the retrieval of an element associated to a specific chronon is achieved,

independently of the chosen type, by simply calling the get(long) method.

62 Java can hold up to 231 - 1 elements within one array, which needs a size of 8 GB main

memory. Nowadays, this amount of memory is not a limit anymore.


Nevertheless, the runtime to retrieve a value from the internally used data

structure depends on the type of array and if the associated element is

cached or not. A (cached) value can be retrieved from a dynamic or an

extended array in O(1) and added in O(n). Regarding a fixed array, the per-

formance of retrieving and adding is O(1). Thus, the preferred type is the

fixed array, which is selected if enough memory is available and the amount

of chronons does not exceed the available size.

Temporal Mapping Function

An important aspect is the handling of interval endpoints, which do not fit

to the time axis granularity or boundaries, e.g., assuming the top time axis

of Figure 7.7 and the value 2015-06-22. If a value does not fit neatly to a

specific granularity, the algorithm has to decide if the value has to be

mapped to the smaller or larger representative. Table 7.1 lists some results

of the mapping algorithm, introduced below, assuming the top time axis

definition of Figure 7.7. It should be stated that the presented results are

not showing the internal index values (i.e., the normalized value). Instead,

the actual year is shown (i.e., the de-normalized value).

Table 7.1: Results of the default temporal mapping algorithm, assuming the top time axis definition of Figure 7.7.

# Interval

([date, date]) Result

([year, year]) Visualization

1 [2001-01-01, 2002-03-01]

[2001, 2003]

2 [1981-01-20, 2081-01-20]

[2001, 2050]

3 [2051-01-20, 2070-01-20]

discarded

4 [2040-12-12, 2050-01-01]

[2040, 2050]


The mapping algorithm uses the following types of information to deter-

mine the mapped value:

– the normalized (or de-normalized) value, and

– the position of the value to be mapped regarding the interval (i.e., is

the value the start or the end endpoint of the interval).

If the value is the start value of an interval, it picks the smaller value, oth-

erwise the larger value is chosen (cf. Table 7.1, #1 and #4). Thus, looking

at the value 2015-06-22 and the top time axis of Figure 7.7, the mapping

algorithm would pick 2015 for the start value, and 2016 for the end value.

Another mismatch occurs, if the provided value exceeds the limits of the

time axis. In that case, the default mapping algorithm maps the value to the

boundary of the time axis, if and only if the other value of the interval does

not exceed the same boundary (cf. Table 7.1, #2). If both values exceed the

same boundary, the interval is discarded (cf. Table 7.1, #3). The last mis-

match that may occur addresses missing values. As already introduced in

section 7.2.1, three strategies are implemented, which can be picked by

configuration. By default, the algorithm applies the boundariesWhenNull

strategy for missing values.

Descriptor Data Structure

A descriptor is managed as a collection of descriptor values. The collection

is thereby optimized for the retrieval of descriptor values using an internally

used identifier, the value, or the unique string representation of the value

(cf. section 7.3.2 for an introduction of the used indexes). Whenever a new

descriptor value is added to the collection the following algorithm is applied:

the value is validated (i.e., according to the specifications defined, e.g., is

null allowed as value and is it unique), a unique identifier is generated using

the specified or default identifier factory (cf. section 7.2.1), and added to

indexes.

A descriptor value is also represented by a data structure63, which

63 The descriptor value is realized as a class, which is assumed to be a data structure as well

following Martin (2009, pp. 93–101).


provides the identifier, the value, the unique string representation, and the

fact function. The fact function is thereby optimized (i.e. the fact is only re-

trieved once, if the type is value- or record-invariant, cf. section 4.2 or 7.3.4)

to increase the performance.

Descriptive Mapping Function

The previous subsection described the data structure used to represent a

descriptor and its values. However, the creation of a new descriptor value

was not introduced. Whenever a descriptive value is pushed to the system,

the system picks up the descriptor the descriptive value belongs to, e.g.,

specified by the structure of the insert statement (cf. section 5.3.1). To de-

termine the descriptor values associated to the descriptive value, the de-

scriptor utilizes the defined mapping function (cf. section 4.2 and 7.2.1).

Figure 7.8 illustrates the handling of an insert statement and the utilization

of the descriptive mapping function to determine the involved descriptor

values.

Figure 7.8: Illustration of the algorithm used to map descriptive values, e.g., [flu, cold] to the descriptor values flu and cold.

Processing a Raw Data Record

Whenever a record is added to the system, the system validates the record

(i.e., by applying the different mapping functions and validation strategies)


and assigns a unique identifier to the record. Once assigned, the unique

identifier cannot be used again by any other record. Nevertheless, cleaning

procedures can be scheduled for any model to reset and reuse available

identifiers (e.g., if a record was deleted). The system is capable of creating

263 – 1 = 9,223,372,036,854,775,807 (i.e., in words more than nine quintil-

lion64) unique identifiers. However, it is worth mentioning, that the currently

available, different bitmap implementations only support the usage of int-

values as position, i.e., 231 – 1 = 2,147,483,647 (in words more than two

billion). Because of the importance of bitmaps for the indexing (cf. section

7.3.2), the system is capable of handling 2 billion raw records with the valid

record index. The whole process of the assignment of a unique identifier is

thread-safe and thereby ensures that no identifier is used several times.

Figure 7.9 exemplifies the processing of a raw data record assuming the

specified time axis definition, the assignment of an identifier of 7 to the

descriptive value cleaning of the descriptor department, as well as the al-

location of a unique identifier of 5 to the record.

Figure 7.9: Example of a result of the processing of a raw data record

Validating a Descriptor Dimension

The validation of a descriptor dimension is performed whenever a dimen-

sion is added to the system, e.g., by configuration (cf. section 7.2.1). The

64 Since Java 8 introduced unsigned int- and unsigned long-values, this number may be in-

creased to 264 - 1 in the future, respectively 232 - 1 for int-values.


algorithm checks every hierarchy of the dimension, by testing the criteria

specified in section 4.4, i.e.,

1. there is only one sink (a.k.a. root),

2. the sink is reachable from every node,

3. every source is referring to a descriptive values, and

4. a partial order over a partition of all nodes is provided.

The validation of the first three criteria, i.e., 1 – 3, is performed by iteration

over the defined nodes. The algorithm starts by picking a node randomly. It

follows the paths to the sink and assigns the minimal and maximal distance

to the sink to each node. If an already assigned node is found, the algorithm

validates, if

– the node was assigned in the same iteration (if so an exception is

thrown, because a loop was found),

– the node is a sink (e.g., has no parents), the algorithm stops, or

– the node cannot reach any sink (if so an exception is thrown, because

criterion (2) is not met).

Afterwards, the algorithm validates if exactly one sink was found (1) and if

every source is a referring to a descriptive value (3). In addition, the algo-

rithm checks if the partial order is provided, by checking the minimal and

maximal distances calculated (4).

7.3.2 Indexes

The system utilizes several index structures to increase the performance

of filtered data, aggregation, and distance calculation. In this section, sev-

eral indexes are introduced and held in main or secondary memory. The

decision regarding the type of memory and used index structure depends

on different aspects, i.e., the number of entities held within the index and

the type of data (e.g., descriptor values or data). Using a secondary

memory typically includes the utilization of a cache, so that performance is

increased (cf. 7.1.2 and 7.3.3). In the following, the index structure used for

descriptors, the bitmap-based index structure used to increase the perfor-

mance of data related tasks, and the indexing of raw data is introduced.


Indexing Descriptors

The collection of the different descriptor instances (i.e., the descriptors)

has to be searched for the unique identifier of a descriptor, which is typically

a string. The number of entities created within a model is, contrary to the

number of data, expected to be small. Thus, a main memory index structure

is utilized. Several tests showed that a HashMap performs best in the case

of strings (cf. section 8.2.1) having in average a complexity of O(1) (cf.

Goodrich, Tamassia (2006, pp. 374–390)). Thus, the implementation of the

descriptors class is based on a hash map, to collect all the descriptor in-

stances and search for one using the unique identifier. Figure 7.10 depicts

the main memory index structure used by the implementation of the de-

scriptors.

Figure 7.10: Illustration of the index structure (HashMap) used by the descriptors index (cf. Goodrich, Tamassia (2006)).

In addition to the search for specific descriptors, it is also important to

be able to search for descriptor values. The different descriptor values are

managed and collected by a descriptor and to find a specific descriptor

value the following attributes are typically used:

– the internally used identifier (used internally by the indexes),

– the value, which might be an object, or a primitive value (used to detect

duplicates), or


– the unique string representation of the value (used when parsing que-

ries).

In general, a main memory index is created for all of these attributes using

the IndexFactory to select the best fitting index (cf. section 7.2.1 and 7.2.2).

In case of the indexes, utilized for the internal identifier and the value, high

performance collections are typically chosen. The index for the unique

string is, as the one for the descriptors and if not others configured, a

HashMap.

Indexing Data for Filtering, Aggregation and Distance Calculation

When retrieving, aggregating, or calculating the distance between da-

tasets, it is important that the selecting of the dataset is performed fast. In

the field of data analysis, the dataset is typically filtered by several attrib-

utes and aggregated (Kimball, Ross 2002; Abdelouarit et al. 2013). In the

case of time (interval) data analysis, the dataset is additionally partitioned

over time prior to aggregation (Kline, Snodgrass 1995; Böhlen et al. 2008).

Figure 7.11 illustrates a typical processing of an analytical query. First, the

filter is applied to retrieve the subset of relevant data from the database.

The resulting subset is partitioned and the aggregation is applied for each

partition.

Figure 7.11: The different tasks (filtering, partitioning, and aggregating) to be per-formed to handle an analytical query.


It is a matter of common knowledge that bitmap indexes outperform typ-

ical tree-based index structures when the used filter addresses several at-

tributes (cf. section 3.3.1, Abdelouarit et al. (2013)). However, the usage of

bitmap indexes to apply different aggregation operators is, with the excep-

tion of count and some context specific operations (e.g., Kaser, Lemire

(2014)), not common. In this section, a bitmap-based index structure is pre-

sented which increases the performance of filtering, aggregation, and also

distance calculation (with regards to the introduced TIDADISTANCE, cf. chap-

ter 6).

The index structure consists of four indexes: valid record index, data

descriptor index, time axis index, and fact descriptor index. Each of the

indexes is motivated in detail in the following, starting with the valid record

index.

The valid record index is used to determine if a record is still valid, i.e.,

not deleted. It only consists of a bitmap (called the tombstone bitmap),

which contains a 1 at the position determined by the record’s unique iden-

tifier, if and only if the record is added correctly and not deleted. The index

is cached and stored, but typically resists in main memory, because of its

frequent usage.

The second index to be introduced is the data descriptor index. It is used

to assign a record to its associated descriptor values. By default, the index

utilizes a HashMap to map a descriptor identifier (i.e., a string) to an array-

like index structure. The array-like index structure associates the internal

identifiers (typically primitives) to bitmaps. Normally, the array-like index

structure is realized by a high performance collection, i.e., by default one

of Trove’s array list implementations. Figure 7.12 depicts the data de-

scriptor index. The complexity of the retrieval of a bitmap, which may be

loaded from the secondary memory if not cached, is in average O(1)65.

65 The retrieval of the collection from the HashMap is O(1). In addition, the high performance

collections typically utilize an array, which has also a search complexity of O(1). In addition, to determine the internal identifier of a specific descriptor value, the descriptors index may be utilized, which has an average search complexity of O(1).


When adding a new record to the system, the bitmaps of the descriptive

values associated to the record descriptive values are set to 1, at the posi-

tion specified by the record’s unique identifier.

Figure 7.12: The data descriptor index, using by default a HashMap and a high performance collection (Trove) to index bitmaps.

The third index used in the context of indexing a record is the time axis

index. The structure of the index, used to retrieve time related entities, is

presented in section 7.3.1. The used array structure ensures (in the fixed

form) a retrieval of the associated bitmap in O(1). The bitmap of a chronon

is set to 1 at the record’s identifier position, if and only if the interval of the

record contains the chronon.

To ensure fast retrieval of the facts associated to a specific record, a

fourth index is created and maintained. The so-called fact descriptor index

retrieves all the facts associated to a descriptor if the fact is value- or rec-

ord-invariant (cf. section 4.2). In addition, it provides a list of descriptor val-

ues having the specified fact as a result of their fact function. More specific,

the index is used to retrieve all the facts for a specific descriptor and the

corresponding descriptor values of the fact. The index is sorted ascending

by fact and collects statistical values like amount of not-a-number and

amount of number facts. If the descriptor contains record-variant facts, the

index will return a null-pointer. The complexity of the index to add a value

is, because of the TreeSet, O(log n). The retrieval of specific values from

the TreeSet is typically not performed. Instead, the minimum, maximum, or

an iterator is retrieved, whereby these operations have a complexity of


O(1). The index persists the sets and may have to load them from the sec-

ondary memory if not cached. Figure 7.13 illustrates an example of the

index structure. For each descriptor a reference to a tree-set like structure

is stored, which holds the value- or record-invariant facts associated to the

descriptor values.

Figure 7.13: Example of the structure of the fact descriptor index, associating facts with descriptor values.

Indexing Raw Data Records

The previously introduced indexes are used to retrieve specific information

about the stored records, i.e., the bitmaps are used to associate the record

to a specific value, whereby the fact descriptor index keeps track of the

facts used. In some situations the retrieval of raw records is necessary,

e.g., when dealing with record-variant facts or if requested by a query.

When retrieving a record, the system will typically determine the unique

identifiers. Thus, the retrieval can be easily performed by a primary key

(i.e., the unique identifier of the record). Modern DBMS are designed to

perform exactly these tasks. Thus, in a productive system, the information

system should outsource this task and utilize a DBMS. Nevertheless, for

non-productive systems the information system offers the functionality to


keep the records in-memory, use a map-based embedded database en-

gine66, or reconstruct a record from the known information available within

the bitmap indexes67.

Using the Indexes for Filtering and Grouping

This section describes the algorithm used to filter and group (i.e., creating

the subsets) the dataset. The algorithms used to aggregate, calculate the

distance, or apply analysis are based on this result. The process is shown

in Figure 7.11 for the case of aggregation. Figure 7.14 depicts an example

database and the state of the indexes (with the exception of a raw records

and the descriptors index). The time axis is assumed to have a minute

granularity, starting at 00:00, and ending at 23:59 (of some random day;

time zone UTC). In addition, two descriptors are defined: the type de-

scriptor using a record-invariant fact function (i.e., cleaning always is

mapped to the value 4 and fueling to 2) and the pos descriptor using a

value-invariant fact function (i.e., always returning 1). Furthermore, one of

the intervals is associated to two descriptor values of the type descriptor

(creating a many-to-many relationship, cf. the summarizability problem

mentioned in section 3.2.1).

66 http://www.mapdb.org 67 The reconstructed record does not reflect the raw record, but contains all data of the record

known by the system, i.e., descriptor values, start and end time, and unique identifier.


Figure 7.14: An example database with data related indexes.

The following select statement is used to exemplify the filtering and group-

ing algorithm:

SELECT TIMESERIES OF SUM(type) FROM sampleModel IN [10:44, 10:45]

WHERE type = 'fueling' GROUP BY type, pos EXCLUDE {('cleaning', '*')} .

After parsing the example query, the filtering and grouping algorithm is ap-

plied. First, the algorithm retrieves the bitmaps referred to in the WHERE-

part of the statement utilizing the data descriptor index, i.e., in the example

the fueling bitmap. The algorithm evaluates the specified logical conditions

(i.e., AND, OR, or NOT) by applying the equivalent logical bitmap operation

to the retrieved bitmaps. The result of these operations is called the filter

bitmap, in the example the filter bitmap is (0, 1, 1). In the next step, the

algorithm retrieves the tombstone bitmap from the valid record index and

AND-combines it with the filter bitmap, resulting in the valid-filter bitmap, in

the example the valid-filter bitmap is equal to the filter bitmap. Afterwards,

the different groups have to be determined. This is done using the de-

scriptors index, which is used to retrieve all descriptor value instances for


a specific descriptor. The algorithm combines the different descriptor val-

ues with each other, validates specified includes and excludes, and creates

for each group the resulting bitmap (using the data descriptor index) by

AND-combining all descriptor value bitmaps of a group. Table 7.2 shows

two examples (one as defined in the sample query) of resulting bitmaps

created for a group by expression.

Table 7.2: Examples of different group-bitmaps created for specific GROUP BY expressions based on the example database shown in Figure 7.14.

GROUP BY Groups Bitmap

type, pos EXCLUDE {('cleaning', '*')} 1: (fueling, A32), 2: (fueling, B35)

1: , 2:

pos, type INCLUDE {('B35', 'cleaning')} 1: (cleaning, B35)

1:

Thus, the final result of the algorithm returns two bitmaps, i.e., (0, 0, 1) for

the (fueling, A32) group and (0, 1, 0) for the (fueling, B35) group. Summa-

rized the algorithm performs the following steps:

1. evaluate filter condition (apply the descriptors index to retrieve the in-

ternally used identifiers) and create the filter bitmap (utilizing the data

descriptor index),

2. retrieve the tombstone bitmap (from the valid record index) and com-

bine it with the filter bitmap to retrieve the valid-filter bitmap,

3. determine the different groups (using the descriptors index to resolve

strings) and create a group-bitmap for each group entry, and

4. combine the valid-filter bitmap with each group-bitmap to create a set

of valid-filter-group bitmap instances for each specified group.

If the level of a descriptor dimension is used within the group by expression,

the algorithm processes the same steps. Instead of retrieving the bitmaps


for each descriptor value when creating the group bitmaps, the algorithm

fetches the bitmaps associated to each member of the level and creates a

group bitmap for each member. Figure 7.15 depicts the process assuming

that an additional descriptor value (i.e., B40) is added without having any

data associated.

Figure 7.15: Illustration of the group bitmap calculation, in the case of the usage of a dimension’s level within the group by expression.

To determine the final result of the query, the specified aggregation op-

erator has to be applied. The algorithm used to determine the final aggre-

gated results, based on the different valid-filter-group bitmaps, is presented

in section 7.3.4. The implementation of the frequently mentioned data re-

trieval from the cache regarding bitmaps (and the so far not further utilized

fact sets) is introduced in the next section, i.e., section 7.3.3.

7.3.3 Caching & Storage

The caching technique and secondary memory utilization depends on the

configuration of the caches (cf. section 7.2). By default, caches are pro-

vided by, e.g., libraries like ehCache, any modern DBMS, or the object-

relational mapping framework Hibernate. Nevertheless, a concrete imple-

mentation of a cache for the information system should be independent


and decide to use an own implementation or to utilize a caching library. The

information system provides techniques, enabling the usage of any cache

regarding the releasing of objects from the cache.

The important aspect is: How a reference of a cached object is handled

within the information system. In general, a reference (e.g., in Java) is as-

signed to be a strong reference, i.e., the object referred is not eligible for

garbage collection as long as the reference exists. Regarding caching,

such a strong reference is helpful as long as the entity is needed, i.e.,

whenever a query is processed. Nevertheless, keeping a strong reference

to an object, which is managed by an underlying cache may lead to

memory problems, because the cache is not capable to remove the object

from main memory as long as other instances hold a strong reference

(Jones et al. 2012, pp. 11–15). If, on the other hand, the cache is capable

to inform the instance keeping the strong reference, the instance is able to

remove the reference. Thus, two different strategies have to be considered

by the information system: (1) a cache publishing the release of an object

to a listening instance or (2) a cache removing the reference to release

memory without any notification.

To support the different types of caches, the information system pro-

vides two interfaces, i.e., the IReleaseMechanismCache and the

IReferenceMechanismCache. The former is used by caches capable to in-

form another instance about the release. The interface forces the cache to

offer a method to register an observer. The information system registers

such an observer and whenever the observer is informed, the strong refer-

ence is removed, so that garbage collection can take place. The latter in-

terface is used by caches, which do not provide any information about re-

moved objects. In that case, the information system holds an instance us-

ing a weak references (Jones et al. 2012, pp. 221–226). Whenever an ob-

ject is requested (e.g., when processing a query), the weak reference is

validated and a strong reference is returned. As long as the object is

needed (e.g., by the query processor) a valid reference is available. When

the strong reference is removed (e.g., because the processing is finished),


the information system has only a weak reference left. Thus, the cache is

capable of managing the objects without publishing any information about

a release.

7.3.4 Aggregation Techniques

As mentioned, aggregating data is one of the pre-dominant operations

used in data analysis. The performance of the aggregation is crucial for any

system. Several performance increasing possibilities have been introduced

in the last years as presented in section 3.3.2. In this section, the algorithm

to calculate aggregates of the form STA and TAT, based on the presented

indexes from section 7.3.2., is introduced. Especially, the array-based time

axis index is of importance to quickly retrieve the bitmaps of the chronons.

Span Temporal Aggregation

The aggregation algorithm expects a set of valid-filter-group bitmap in-

stances to be passed, as well as the parsed query. Receiving these param-

eters, the algorithm determines the relevant chronons selected by the

statement. Furthermore, the algorithms checks if a partition, in form of a

dimension’s level, is specified within the statement. Looking at the follow-

ing, previously used, example statement the algorithm determines the

chronons representing 10:44 and 10:45, as well as the absence of a di-

mension’s level:

SELECT TIMESERIES OF SUM(type) FROM sampleModel IN [10:44, 10:45]


Based on this information and the passed parameters, the algorithm is ca-

pable of performing the aggregation for each single chronon, by applying,

for each bitmap associated to a chronon, a logical AND-operation with the

valid-filter-group bitmap of each group. The result is a list of final bitmaps,

which can be used to calculate any aggregation using STA. Figure 7.17

illustrates the final bitmaps for the different chronons and groups, i.e., (fuel-

ing, A32, 10:44), (fueling, B35, 10:44), (fueling, A32, 10:45), and (fueling,

B35, 10:45).


Figure 7.16: The four resulting bitmaps for the different chronons and groups.

Based on the final bitmap and the fact descriptor index, the algorithm

calculates the aggregated value for each chronon. Table 7.3 shows the bit-

map-based algorithms for each aggregation operator. Some operators uti-

lize the fact descriptor index (referred to as factDescIdx) to retrieve the facts

associated to the specified descriptor. The implementation provides the

possibility to iterate in ascending, descending, or random order. In addition,

the iterator returns descriptor values, which can easily retrieve their fact (if

record-variant, the raw data record index is used) and the associated bit-

map (using the descriptors index). The iterator retrieved from the fact de-

scriptor index is also bitmap-based and uses internally the final bitmap,

which is passed as parameter when creating the iterator. The algorithm

utilized for iteration, combines the final bitmap with the one of the current

descriptor value (i.e., which is associated to the current fact) and returns

as many times the current fact as the received bitmap’s cardinality (i.e.,

count). The complexity of the algorithms may be determined by considering

that:


– the count-operator can be assumed to perform in O(1) ("computing the

cardinality of a Roaring bitmap can be done quickly: it suffices to sum

at most ceil(n/216) counters" (Chambi et al. 2015)),

– the iteration is done in O(m) (with m being the cardinality of the de-

scriptor), and

– the complexity of logical operations is "O(n1 + n2) time, where n1 and

n2 are the respective lengths of the two compared arrays" (Chambi et

al. 2015).

However, the latter statement depends on data added to the system, i.e.,

the size of the arrays cannot be determined. Thus, a simple average com-

plexity cannot be provided. Nevertheless, Chambi et al. (2015) state that

"we can compute and write bitwise ORs at 700 million 64-bit words per

second", which sounds sufficient, even if they state further that if they "com-

pute the cardinality of the result as we produce it, our estimated speed falls

to about 500 million words per second".

Table 7.3: List of algorithms used to calculate the different aggregated values.

aggregation operator

Aggregation Algorithm bf ≙ final bitmap, bt ≙ bitmap of chronon

sum

it = factDescIdx.iterator(bf);

res = NaN;

while (dv = it.next())

res += dv.fact ∙ count(dv.bitmap AND bf);

return res;

median it = factDescIdx.ascIterator(bf);

even = (count & 1) == 0;

firstPos = floor(count * 0.5) + (even ? ‐1 : 0);

curPos = 0;

while (curPos != firstPos) {

it.next();

curPos++;

}

if (even) {

return 0.5 ∙ (it.next().fact + it.next().fact);

} else {

return = it.next().fact;

}


mode it = factDescIdx.ascIterator(bf);

lastFact = NaN;

mode = NaN;

maxAmount = 0;

counter = 0;

while (it.hasNext()) {

fact = it.next().fact;

if (lastFact == fact) {

counter++;

continue;

} else if (counter > maxAmount) {

maxAmount = counter;

mode = lastFact;

} else if (counter == maxAmount) {

mode = NaN;

}

counter = 1; lastFact = fact;

}

if (counter > maxAmount) {

return lastFact;

} else if (counter < maxAmount) {

return mode;

} else {

return NaN;

}

count return count(bf)

min it = factDescIdx.ascIterator(bf);

return it.getNextFact();

max it = factDescIdx.descIterator(bf);

return it.getNextFact();

mean return sum / count;

count finished return count((bf XOR bt+1) AND bf)

count started return count((bt‐1 XOR bf) AND bf)

Two-Step Aggregation Technique

As mentioned at the beginning of the previous section, the algorithm

checks the parsed query for the relevant chronons, as well as a defined

partition. If a partition is defined, the TAT can be applied to calculate the

aggregated value across the partitions. Modifying the previously used sam-


ple query by specifying a dimension’s level (i.e., HOUR), a second aggre-

gation operator (i.e., MAX), and a different time window (i.e., [10:00,

12:00]), the following sample query is assumed:

SELECT TIMESERIES OF MAX(SUM(type)) ON TIME.PARTITION.HOUR

FROM sampleModel IN [10:00, 12:00]


To determine the result of the query, the algorithm performs the same steps

as previously explained in the case of STA. After all values for a specific

partition are retrieved, the algorithm simple applies the second operator to

the set of retrieved numbers (which might have to be sorted, e.g., in the

case of median).

SELECT TIMESERIES OF SUM(type) ON TIME.PARTITION.HOUR

FROM sampleModel IN [10:00, 12:00]


In that case, the chronons of each partition are OR-combined prior to

be combined with the valid-filter-group bitmap. Figure 7.17 illustrates TAT

and STA and the differences when aggregating.

Figure 7.17: Illustration of TAT and STA.

7.3.5 Distance Calculation

In this section, the algorithms for the different distance measures, intro-

duced in chapter 6, are presented. Two of the three distance measures (i.e.,

the temporal order distance and the temporal measure distance) are based


on results already presented. Nevertheless, an important aspect to in-

crease the performance when calculating distance measures are criteria,

which define that the calculation can be terminated. In general, these cri-

teria are based on upper bounds, which ensure that a distance cannot be

smaller than the calculated bound. Thus, for the two mentioned distances,

the focus lays on the definition of an upper bound, whereby the focus for

the third distance, i.e. the temporal relational distance, is focused on the

algorithm. The section is divided into three subsections. The first one intro-

duces the abort criterion for the temporal order and measure distance, as

well as the algorithm. The second discusses the temporal relational dis-

tance and the third discusses how to combine the different algorithms effi-

ciently.

Temporal Order and Measure Distance

The temporal order and measure distance can be calculated by retrieving

the time series applying the algorithm used to process a query. The tem-

poral order can be seen as a special case of the measure distance. In gen-

eral, the measure distance is calculated based on a time window, a meas-

ure, and optionally a level of the time dimension. In the case of the temporal

order distance, the measure is count calculated on the lowest granularity

(i.e., no level of the time dimension is selected). As defined by Definition

17 and Definition 19, the distance is the sum of the difference between the

mapped time points. The calculation of a distance can thereby be aborted

as soon as the current distance is larger than the largest distance of the

found k-NN. Figure 7.18 illustrates the abort criterion. The current dataset

is compared to the source by calculating the distance for each time point.

As soon as the distance value Dist is larger than 11,386 the calculation is

aborted. If the calculation reaches the end, the dataset will be added to the

list of k-NNs and the last entry will be removed.


Figure 7.18: Illustration of the abort criterion for the temporal order and measure distance.

It should be mentioned that the time series is iteratively calculated. Thus,

the abortion also ensures that no further values of time points are calcu-

lated. Furthermore, the calculation works analogously if groups are speci-

fied. In that case, the distance between each group is calculated and

summed up. As soon as the current distance exceeds the distance of the

last entry of the list of the k-NN, the calculation is terminated and the next

subset is evaluated.

Temporal Relational Distance

To calculate the temporal relational distance, it is necessary to determine

the relation between each pair of intervals of the subset and assign it to a

time point as specified in Table 6.1. Within the next step, the distance be-

tween the ordered set of vectors is calculated as defined by Definition 18.

Because of these two steps, it is necessary to scan the complete subset


and calculate all relations, prior to applying any abortion criterion. The re-

lations can thereby be determined in O(n), with n being the amount of

chronons covered by the subset, and a memory usage of O(n + m2), with

m being the amount of intervals contained within the subset.

Figure 7.19 illustrates the bitmap-based algorithm, which determines

the relations between two intervals. The algorithm iterates over each time

point, determining the bitmap for the current group of the current time point

(as described in section 7.3.2: Using the Indexes for Filtering and Group-

ing). In the next step, the determined bitmap is combined with the bitmap

of the previous time point68 to create three bitmaps: (1) a bitmap with all the

intervals just started, (2) a bitmap with all the intervals finished, and (3) a

bitmap with all the intervals still being active. Each bitmap can be easily

determined by logically combining the previous bpre and the current bitmap

Figure 7.19: Illustration of the algorithm used to determine the relations between intervals.

68 If no previous bitmap is present, e.g., in the case of the first time point, the bitmaps are assumed to be empty.


bcur, i.e.: (1) (bpre ⊕ bcur) ∧ bcur, (2) (bpre ⊕ bcur) ∧ bpre, and (3) bpre ∧ bcur. Within

the last step, the algorithm collects new information for each current pair,

i.e., start-relation, end-relation, and start-end-relation. Whenever all three

relations of a pair are known69, the algorithm is capable of specifying the

relation of the pair and the referred time point. A pair is thereby referred to

by a unique identifier, which is determined by the pairing-function shown in

Listing 7.18.

Listing 7.18: The pairing function used to determine a unique identifier for a pair of intervals.

function long uniqueId(recId1, recId2) {

long x = Math.max(recId1, recId2);

long y = Math.min(recId1, recId2);

return (long) (0.5 * (x * (x + 1)) + y);

}

Having all relationship and the endpoints of the pairs it is easy to calcu-

late a time series based on the formulas shown in Table 6.1. After the cal-

culation of the time series, the distance is calculated by determining the

distance between each mapped time points. The abortion criterion can be

applied as described for the order and measure distance. Nevertheless,

the termination has small impact regarding the performance, because the

expensive detection of relations has to be performed.

Calculating the Temporal Similarity Measure

The temporal similarity measure can be applied by calculating the different

distances using the algorithms presented in this section. A distance is

thereby only calculated if the weighting factor is larger than 0. In addition,

to increase performance, the temporal relational distance is only calculated

69 Figure 7.19 shows that it is not always necessary to know all three relations. In the case of,

e.g., ends-by it is enough to know the start and start-end relation.


if the weighted distance applying the temporal order and measure distance

is not larger than the currently last k-NN.

7.4 User Interfaces

The GUI was implemented separately from the information system’s back

end. The aim of this separation was to ensure that the information system’s

interfaces, as well as the provided information, are sufficient regarding the

requirements of a user interface (cf. VIS-05). The implementation of the

GUI is web-based, i.e., utilizing HTML5, JavaScript, and CSS, and is based

on the Bootstrap framework70. In addition, the Highcharts library71 was uti-

lized to visualize time series and a Gantt-chart widget was implemented

using SVG and the jQuery library72.

Figure 7.20 illustrates different screenshots of the user console of the

GUI. The figure shows the login screen, the model, data, and user man-

agement (cf. VIS-03), as well as the UI for analytical tasks (cf. VIS-04). The

different shown sections of the user interface are described in the following:

– The model management is used to add or delete new models to the

system. In addition, a model can be loaded and unloaded. The latter is

useful if the model is not needed anymore and thereby can be removed

from memory. An unloaded model is not available anymore, e.g., for

querying.

– The data management enables the user to insert or remove time inter-

val data to or from a model. It is possible to add data from the model

itself, a CSV file, a DB query, or as single record using the UI.

– The user management allows the creation of new, deletion of old, and

editing of available users or roles. A user is defined by a name, a pass-

word, the assigned roles, and the granted permissions. A role, on the

other hand, is specified by its name and the permissions.

70 http://getbootstrap.com 71 http://www.highcharts.com 72 https://jquery.com

7.4 User Interfaces 177

– Last but not least, the interface for analytical tasks provides the possi-

bility to fire SELECT statements against the information system, which

are, depending on the statement, illustrated as time series or as a

Gantt-chart.

Figure 7.20: Overview of the user console of the implemented UI: top-left shows the login screen, top-right is a screenshot of the model management, middle-left is a picture of the data management, middle-right illus-trates the user management, and the screenshots on the bottom show the time series visualization (left) and the Gantt-chart (right).

The UI provides additional sections which are not shown within the figure,

e.g., documentation (provides tutorials and explanations about the different


interfaces, the query language, services, and configuration) or home

(shows some general information on how to get in contact or participate).

Besides the actual GUI, a JDBC driver is provided to enable the usage

of the data within third party tools (cf. DC-02, PR-02, VIS-01). These tools

can be utilized to load data into the different models (e.g., a data integrator

firing INSERT statements), visualize information on a dashboard (e.g., us-

ing a modern web-framework), or create reports (e.g., by applying a BI

tool). The JDBC driver can be used to fire any query (cf. chapter 5) against

the information system.

7.5 Summary

In this chapter, the system architecture and its components were pre-

sented, which are motivated by multiple feature requests (e.g., DA-01, PD-

01, PD-02, MA-01, or MA-02) and performance considerations. Further-

more, the possibilities of configuring the model and system were intro-

duced. The configuration allows extending or replacing several compo-

nents of the system so that the information system is highly configurable to

domain-specific needs.

Several aspects of the realization were introduced, e.g., data structures,

indexes, and algorithms. The introduced indexes can be used in a holistic

way, i.e., the indexes can be used for different techniques like filtering,

grouping, aggregation, pre-aggregations, distance measures, and distrib-

uted calculations and are the answer to RQ4: "Which indexing techniques

can be used to process user queries and how should data be cached, as

well as persisted". In general, an optimized solution may be considered for

each individual task. Nevertheless, from a system perspective it is im-

portant to utilize an index, which is capable of supporting as many features

as needed. Furthermore, the performance measures, depicted in the next

chapter, show that the implementation outperforms state of the art propri-

etary software, which, e.g., does not support as many aggregation forms

7.5 Summary 179

and operators, needs expensive integration processes, and does not pro-

vide as many time related features.

Within the section, a minimal and functional GUI was presented (cf.

VIS-03 and VIS-04). The GUI can be used to analyze time interval data and

visualize requested results. The presented implementation is developed in-

dependently from the information system and utilizes the provided web-

services. Thus, it can be understood as a prototypical implementation of a

domain-specific GUI using the information system to enable users to ana-

lyze time interval data (cf. VIS-05). Furthermore, the JDBC driver was in-

troduced as another UI enabling the usage of the information system within

third party tools (cf. VIS-01).

Summarized, this chapter provides the answer to RQ6: "How should the

architecture of an information system for time interval data analysis be re-

alized, how should the system be configured, and which interfaces have to

be provided to support the analyzing process". As stated, the architecture

and the components are presented, the configuration capabilities are intro-

duced, and the different interfaces are shown.

8 Results & Evaluation

The realization of the feature requests (cf. section 2.2) in a performant way,

is an important criterion for the evaluation of the implementation of the in-

formation system. In addition, the user acceptance and usability are criteria

to be validated. In the first section of this chapter, i.e. section 8.1, the fulfill-

ment of the features requested is validated. Furthermore, the feedback of

current users regarding the features and the processing performance is

discussed. In section 8.2, available high performance collection libraries,

the performance and memory usage of the system, and selected algo-

rithms are evaluated. For comparison of the query language processing, a

main memory based version of the IntervalTree (cf. Edelsbrunner, Maurer

(1981), Kriegel et al. (2001)) was implemented. In addition, proprietary

tools were utilized to test the information system’s performance against ic-

Cube and the Oracle DBMS (following Song et al. (2001), Mazón et al.

(2008), and Niemi et al. (2014) to detect and solve (if possible) occurring

summarizability problems).

8.1 Requirements & Features

Besides the performance of the system, the fulfillment of the features re-

quested and requirements formulated in section 2.2 are an important qual-

ity criterion of the implemented information system. Within the chapters,

the different features were addressed and used as motivation for specific

decisions. Table 8.1 shows the specified features, explains the realization,

and the degree of support. The comments presented for some features are

given by users of the information system. In general, these comments can

be understood as enhancement requests or new feature requests.


182 8 Results & Evaluation

Table 8.1: Overview of the different features requested, the realization of the fea-ture, as well as comments of the users (if available), and the degree of realization.

Feature Realization and Comments degree

DA-01, DA-02

Aggregation of time interval data The different aggregation operators of time interval data are supported by the query language (cf. sec-tion 5.3.3) and the bitmap-based implementation is introduced in section 7.3.4 (cf. Table 7.3). In addi-tion, the extensibility of operators is presented in section 7.2.2.

DA-03 Temporal operators to retrieve raw records The introduced query language (cf. section 5.3.3) allows the specification of time window using tem-poral operators (cf. Figure 5.1).

DA-04, DA-05

Definition of dimensions, hierarchies, and lev-els Descriptor dimensions and the time dimension is specified in detail in section 4.4 (cf. Definition 14 and Definition 15). The requested roll-up and drill-down operations are supported by the query lan-guage (cf. section 5.3.3). The implementation of the operations is aggregation based and intro-duced in section 7.3.4.

Comments The definition of a level’s member is specified by regular expressions. Regular expressions are sometimes difficult to be formalized (especially for number ranges). An alternative, more user-friendly expression language is desired. In addition, it would be nice to load dimensions, e.g., from a table of a database.

DA-06 Support of different time zones The support of different time zones is achieved by the presented dimensional model (cf. section 4.4). To retrieve the data from a specific time zone de-pendent view is utilized within a SELECT state-ment (cf. section 5.3.3).


Comments To combine data from different time zones within one query a UNION statement should be available.

DA-07 Similarity Measure The comparison of different sets of time interval data is achieved by the presented similarity meas-ure (cf. chapter 6). The measure can be used by firing an analytical query (cf. section 5.3.3). The implementation of the calculation is shown in sec-tion 7.3.5.

DA-08 Query language The query language is introduced in detail in chap-ter 5. It is divided into a DCL, DDL, and DML, cov-ering all the requested features. Comments When a model is modified, it has to be removed and added as new model. The language should be extended to support the update of models.

PD-01 Notification system The system is capable of triggering a job, when-ever a specific event occurs within the system (cf. Figure 7.1). The system itself supports the assign-ment of jobs to core events or user-defined events (cf. section 7.2.1), e.g., triggered as result of an analysis.

PD-02 Analytical algorithms To allow the execution of analytical algorithms, the system provides an analysis manager (cf. Figure 7.2). In addition, the query language (cf. 5.3.3), as well as the configuration (cf. 7.2.2) support the us-age and binding of analytical algorithms (e.g., pat-tern or association rule mining).

PR-01 Prediction of upcoming situations The prediction of an upcoming situation is an ana-lytical task. The utilization of an algorithm is needed. Thus, the analysis manager (cf. Figure 7.2) can be used, once the model for the prediction is known. The concrete implementation can then


fire an event, which is observed by a schedule and triggers the notification. Therefore, the requirement is not fulfilled because no generic algorithm to pre-dict an upcoming situation is available. Neverthe-less, the system provides the functionality to notify a user, once the model is implemented.

PR-02 Usage of third party prescriptive analytic tools The available JDBC driver (cf. section 7.4) and the defined query language (cf. section 5.3.3) allow the retrieval of data and other analytical results.

DC-01 Data sources (CSV, XML, DBMS, or JSON) The system supports the definition of so called data retrievers (cf. section 7.1 and 7.2.1). By de-fault, a data retriever for CSV files and DBMS que-ries is provided. Additional retrievers can be easily implemented against the provided interface.

DC-02 JDBC driver, query language, and bulk load-ing The feature requests statements to insert or delete records. Both statements are supported by the DML of the query language presented in section 5.3.1. In addition, bulk loading is described in this section and the JDBC driver is shortly introduced in section 7.4.

Comments The UPDATE and DELETE statement need the user to specify a record identifier. The identifier can be retrieved from the result-set of an INSERT-statement or using a SELECT RECORDS state-ment. It would be nice to update or delete records by specifying criteria based on the descriptor val-ues. The presented query language and its processing do not support any type of transactions. A record inserted, updated, or deleted is processed by the system as an atomic operation. Nevertheless, roll-backs needed after several operations have to be performed manually. Thus, it would be better if the system supports transactions.


DC-03 Pre-aggregates The definition of pre-aggregates is currently not supported by the system. Performance tests showed that there is, so far, no need for a support. Nevertheless, in the future such pre-aggregates may be needed to increase the performance. Thanks to the bitmap-based data representation (cf. section 7.3.2), as well as the caching and stor-age implementation such pre-aggregates (cf. sec-tion 7.3.3) can be easily added.

DI-01 Complex data structures and many-to-many relationships The support of complex data structures is achieved by several functionalities. First of all, a complex data structure can be pre-processed using scripts as defined in section 7.2.1. In addition, it is possible to extend the system and provide domain-specific descriptors (cf. section 4.2). The support of many-to-many relationships is achieved by allowing a de-scriptive mapping function to map a value to sev-eral descriptor values.

DI-02 Validation of descriptive values The system provides several possibilities to vali-date a descriptive value (cf. section 4.2 or 7.3.1). The feature specifies several strategies, used to validate descriptive values. These strategies are implemented and selectable by configuration (cf. section 7.2.1).

DI-03 Validation of intervals The validation of intervals is possible by applying available strategies (cf. section 7.2.1) or imple-menting own time axis handlers (cf. section 7.3.1).

DI-04 Pre-processing using scripts Scripts can be utilized to pre-process raw data, prior to the integration of the records into the sys-tem. A pre-processor can be defined via configura-tion as described in section 7.3.1.

MA-01, MA-02

Apply models and schedule analysis


The scheduler and event manager introduced in section 7.1 is used to apply models or schedule an analysis. The configuration of the component is described in section 7.2.1.

VIS-01 JDBC driver for third party BI tools or visuali-zations As stated in section 7.4, a JDBC driver is available and tested regarding the usage of it within BI tools.

VIS-02 Subscribe to alerts The feature requests a GUI useful to define sched-ules. Schedules are currently only definable within the configuration (cf. section 7.1). The feature was not realized, because comparing the implementa-tion effort and the benefit are not sufficient.

VIS-03 User management The GUI provides a user management as illus-trated in Figure 7.20.

VIS-04 Minimal GUI to request query results The GUI utilizes a line and a Gantt-chart to visual-ize the results of time series or record queries (cf. Figure 7.20).

VIS-05 JSON interface The development of the GUI presented in section 7.4 was performed separately from the implemen-tation of the back end. Nevertheless, the available JSON interface is utilized by the GUI.

The table shows that two of the features requested, DC-03 and VIS-02,

were not realized within the presented information system. As already men-

tioned, the feature DC-03 is currently not needed, but may be added in the

future. Thanks to the bitmap-based index, pre-aggregates can easily be

calculated for often used filters or selected members of a level of a hierar-

chy of a dimension. To keep the pre-aggregates up-to-date, it is necessary

to add a mechanism allowing for the determination of the pre-aggregates

8.2 Performance 187

to be updated when a change occurs (e.g., an insert or update is per-

formed73). The VIS-02 is not realized because of the effort needed to imple-

ment such a GUI element (within a research project).

8.2 Performance

In this section, several performance tests regarding runtime and memory

usage are presented. All tests were performed on an Intel Core i7-4810MQ

with a CPU clock rate of 2.80 GHz, 32 GB of main memory, an SSD, and

running 64-bit Windows 8.1 Pro. As Java implementation, we used a 64-bit

JRE 1.6.45, with XMX 4,096 MB and XMS 512 MB. The tests were per-

formed on the datasets listed in Table 8.2. Some tests used additional da-

tasets not shown in the table. These sets are introduced in the context of

the test.

Table 8.2: List of algorithms used to calculate the different aggregated values.

Dataset (type)

features of the model and dataset

raw records (∅ interval

length)

time axis (granularity &

amount of granules)

descriptors (list of descriptors &

cardinality, i.e., amount of descriptor values)

gh74 (real-world)

1,122,097 (∅ 42 min)

one year minutes 525,600

person (713), task-type (4),

work-area (31)

phone-calls (real-world)

63,825 (∅ 47 min)

two years minutes

1,051,200

caller (77), recipient (981), origin (50), desti-

nation (246)

The performed tests are organized in the following sections; tests re-

garding the performance of high performance collections are presented in

section 8.2.1, tests measuring the loading performance of the systems are

73 A delete may not have to be considered, if the tombstone index is applied to the pre-aggre-

gates. 74 Available online at: https://www.researchgate.net/publication/267979679


summarized in section 8.2.2, results retrieved from tests concerning the

selection performance are shown in section 8.2.3, the results of the dis-

tance performance measurements are outlined in section 8.2.4, and the

tests evaluating the performance of the system in comparison to other pro-

prietary systems are presented in section 8.2.5.

8.2.1 High Performance Collections

As mentioned in section 7.3.2, the system utilizes high performance col-

lections to increase performance when retrieving indexed descriptors or

bitmaps. The implemented default factory has to decide which index to pro-

vide for a specific setting. Thus, several tests using high performance col-

lections and the Java default collections were performed, to pick the best

suited collection for a specific primitive data type and operation (i.e., re-

trieve, insert, or check containment). The test added, retrieved, or checked

the containment of 1,000,000 created descriptors. The descriptors were in-

dexed by different primitive data types, i.e., int or long75. Each test was per-

formed ten times and the average values are presented in Figure 8.1.

Figure 8.1: The results of the tests regarding the high performance collections for int and long data types.

75 The test was also performed for byte and short. Nevertheless, the results were similar and therefore are not presented.

8.2 Performance 189

As illustrated, the Trove implementation outperforms all other high perfor-

mance collections, as well as the Java collections. Thus, the factory selects

Trove’s high performance collections when indexing byte, short, int, or long

values. Whenever a string value is used as key, a default HashMap is

picked.

8.2.2 Load Performance

To measure the load performance the gh dataset was used. The perfor-

mance of the insertion of data is highly dependent on the used caching.

Thus, the tests were performed using:

– no cache, i.e., everything was inserted in main memory,

– a file-based storage with a cache configured to use a RR cache algo-

rithm76, 20 % cleaning factor if the cache overflows, and a maximum

size of 100,000 objects per cache, and

– a file-based storage with a cache configured to use a RR cache algo-

rithm, 80 % cleaning factor if the cache overflows, and a maximum size

of 500,000 objects per cache.

Furthermore, the performance was measured by adding 1,000,000 records

using several bulk-loads in different sizes, i.e., the tests were performed

using 100 chunks of 10,000, 20 chunks of 50,000, 10 chunks of 100,000,

5 chunks of 200,000, 2 chunks of 500,000, and finally 1 chunk of 1,000,000

records. Figure 8.2 illustrates the results of the load performance tests. The

runtime performance of the memory and large file-based cache with high

clean-up rate (i.e., File, 80 %, RR, 500k) stays almost constant in all sce-

narios (i.e., 0.046 ms per record in the case of the memory and 0.061 ms

using the file-based cache). The runtime performance of the smaller cache

with a 20 % clean-up rate leads to a lot of write operations on the hard

drive, because the cache overflows more often. The small clean-up rate,

76 Other tests performed showed that the RR algorithm is the best choice. All statistic-based

algorithm were must of the time busy updating the statistic. Nevertheless, depending on the scenario these algorithms may be suited better, e.g., when retrieval is more important than loading.


leads to several overflows within one bulk load. In addition, the random

replacement removes entities, which might be needed within the same

load. However, even the average worst runtime performance (i.e., around

1 ms), enables the system to process 1,000 records per second.

Figure 8.2: The results of the load performance tests.

Regarding the memory usage, the tests show that the caches perform as

expected. Considering the garbage collector of Java and the settings used

for the tests, the results have to be considered with care. The decision

whether the garbage collector has to run or not is decided by the Java

Runtime Environment. Thus, elements may stay in main memory, because

garbage collection did not remove them yet (cf. Jones et al. (2012)).

8.2.3 Selection Performance

To evaluate the runtime performance of the bitmap-based algorithm for the

processing of select statements, three additional algorithms, not support-

ing group by, multiple measures, storage, multi-threading, or generic di-

mensions (i.e., the used dimensions are hard-coded), were implemented,

i.e.:

– a naïve algorithm (performing a sequential scan), which is used as

baseline,

8.2 Performance 191

– a main memory IntervalTree-based algorithm, which fills the tree once

with the raw records, and

– another main memory IntervalTree-based algorithm, whereby this im-

plementation creates an intermediate filtered tree.

The naïve algorithm, named Naïve, is shown in Listing 8.1. The algo-

rithm filters the records according to the specified filter criteria of the query,

i.e., the time window and the logical expression defined in the WHERE

clause (cf. line 04). Next, the algorithm determines the ranges defined by

the dimension and the time window and iterates over each partition (cf. line

06). In each iteration the algorithm determines the records for the current

range by filtering the previously created set of records (cf. line 08). The

algorithm calculates the value of the measure based on the set of records

and the current range. The calculated value is assigned to the time series

(cf. line 12).

Listing 8.1: The naïve algorithm.

01: TimeSeries naive(Query q, Set r) {

02: TimeSeries ts = new TimeSeries(q);

03: // filter time defined by IN [a, b] and WHERE expression

04: r = filter(r, q.time(), q.where());

05: // iterate over the ranges defined by IN and ON

06: for (TimeRange i : q.ranges()) {

07: // filter records for the range

08: r’ = filter(r, i);

11: // determine measures defined by OF

12: ts.set(i, calc(i, r’, q.meas());

13: }

14: return ts;

15: }

The first IntervalTree-based algorithm, named IntTreeA, works similar

to the naïve algorithm. Instead of retrieving a set of records (cf. line 01 of

the Naïve algorithm), the algorithm receives an IntervalTree. The tree is

used to apply the filter criteria defined by the time window. The resulting

set of records is filtered by the logical expression defined by the WHERE


clause and processed as in the Naïve algorithm (cf. line 06 - 15). The sec-

ond IntervalTree-based algorithm, named IntTreeB is shown in Listing 8.2.

The algorithm receives the records organized in the IntervalTree and cre-

ates an intermediate tree containing all the filtered records (cf. line 04).

Afterwards, the algorithm iterates over the defined ranges and uses the

intermediate tree to pick the records valid for the specified range (cf. line

08). Within each iteration the measure for the selected subset is calculated

and set (cf. line 12). After the iteration is finished, the algorithm returns the

created time series.

Listing 8.2: The IntTreeB algorithm.

01: TimeSeries intTreeB(Query q, IntervalTree tree) {

02: TimeSeries ts = new TimeSeries(q);

03: // filter time defined by IN [a, b] and WHERE expression

04: IntervalTree interTree = filter(tree, q.time(), q.where());

05: // iterate over the ranges defined by IN and ON

06: for (TimeRange i : q.ranges()) {

07: // use the tree to get the filtered records within the range

08: Set records = filter(interTree, i);

11: // determine measures defined by OF

12: ts.set(i, calc(i, records, q.meas());

13: }

14: return ts;

15: }

All four algorithms were tested with the same Java settings, using the

gh and phone-calls database. To assess the performance for different sizes

of datasets, the gh database was used to create several subsets, i.e.,

10,000, 100,000, and 1,000,000. In addition, several different query types,

regarding the selectivity, filter criteria, and measure complexity were pro-

cessed. Table 8.3 gives an overview over the different queries fired against

the different datasets, showing the characteristics of each query and da-

taset combination. The categories simple, average, and high used for the

complexity are indicators, helpful to classify the results.

8.2 Performance 193

Table 8.3: Overview over the different tests performed to validate the runtime per-formance.

nr. dataset #selected records

#total re-cords

measure comple-

xity filter com-

plexity

COUNT(1); -; [01.01.2008, 01.02.2008); WORKAREA.LOC.TYPE='Gate'

#1a gh 147 10,000 simple average #1b gh 1,572 100,000 simple average #1c gh 15,391 1,000,000 simple average

MAX(COUNT(1)); TIME.DEF.DAY; [01.01.2008, 01.02.2008); TASKTYPE='short'

#2a gh 503 10,000 average simple #2b gh 5,058 100,000 average simple #2c gh 51,461 1,000,000 average simple

MAX(SUM(PERSON) / COUNT(1)); TIME.DEF.MIN5; [01.01.2008, 01.02.2008); TASKTYPE='short'

#3a gh 102 10,000 high simple #3b gh 995 100,000 high simple #3c gh 9,727 1,000,000 high simple

MIN(TASKTYPE); TIME.DEF.MIN5; [01.01.2008, 01.02.2008); WORKAREA.LOC.TYPE='Ramp' OR PERSON='*9'

#4a gh 99 10,000 average average #4b gh 1,002 100,000 average average #4c gh 9,912 1,000,000 average average

SUM(COUNT(1)); TIME.DEF.MIN5; [01.01.2008, 01.02.2008); (WORKAREA.LOC.TYPE='Ramp' OR PERSON='*9') AND (TASKTYPE='long' OR TASKTYPE='short')

#5a gh 79 10,000 high high #5b gh 846 100,000 high high #5c gh 8,375 1,000,000 high high

SUM(TASKTYPE); TIME.DEF.DAY; [01.01.2008, 01.02.2008); WORKAREA='BIE.W03' AND (TASKTYPE='long' OR TASKTYPE='very long')

#6a gh 79 10,000 simple high #6b gh 846 100,000 simple high


#6c gh 8,375 1,000,000 simple high

COUNT(1); -; [01.01.2014, 01.02.2013); -;

#7 phonecalls 10,583 63,825 simple simple

MAX(SUM(CALLER)); TIME.DEF.DAY; [01.01.2014 00:00:00, 01.02.2013); ORIGIN='Kansas'

#8 phonecalls 493 63,825 average simple

SUM(COUNT(1)); TIME.DEF.DAY; [01.08.2013, 01.08.2013); CALLER='L*' AND (RECIPIENT='A*' OR RECIPIENT='M*')

#9 phonecalls 2,877 63,825 average high

The results of the runtime performance tests are depicted in Figure 8.3

and a detailed list showing all measured runtimes can be found in the ap-

pendix. The results show that the implementation presented in this book

(i.e., chapter 7) outperforms the other implementations in all tests, except

for #4a, #4b, #5a, #6a, and #6b. The reasons lie above all in the low selec-

tivity of these queries (i.e., the ratio between the amount of selected and

total records is small) and the fact that the records remained in main

memory during the test. If a storage would be utilized, the IntTreeB algo-

rithm would have to retrieve each record from the second memory and val-

idate the filter. On the other hand, the TIDA algorithm can apply the filters

using bitmaps, which also may have to be retrieved from the second

memory. Nevertheless, the retrieval of a bitmap (including information

about all records for the filtered attribute) is performed faster than any rec-

ord retrieval (cf. Abdelouarit et al. (2013)). In addition, applying the TIDA

algorithm to larger datasets (e.g., #1-6c, #7, #8, and #9) shows the su-

premacy of the bitmap-based implementation.

8.2 Performance 195

Figure 8.3: The results of the selection tests for the different queries shown in Table 8.3.


8.2.4 Distance Performance

Within this section, the performance of the algorithms used to determine

the temporal order, relational, and measure distance is evaluated. The tem-

poral order distance is compared to the IBSM algorithm introduced by Ko-

tsifakos et al. (2013) (cf. section 3.5). The performance of the measure and

relational distance is not compared, because an implementation of the

ARTEMIS algorithm was not available when requested from Kostakis et al..

Within the test, the gh dataset was used, searching for the 1-NN, 3-NN, 5-

NN, and 10-NN. The search was performed against the whole dataset. The

source (i.e., the subset) for each test was selected for each type of dis-

tance, i.e., for the order a day was randomly picked, for the measure dis-

tance the following measures were calculated: (1) SUM on lowest granu-

larity for a day searching for similar days, and (2) MAX-COUNT on day level

for a month searching for similar months, and regarding the relational dis-

tance a day was randomly picked and filtered for a specific type and work-

area. In addition, the results of IBSM and the presented bitmap-based al-

gorithm to determine the temporal order distance were compared, and the

resulting nearest neighbors were equal for all tests. Figure 8.4 illustrates

the results of the tests, i.e., the measured runtime, as well as the 3-NN for

the temporal order and the two temporal measure similarities. A visualiza-

tion of the 3-NN of the temporal relational similarity can be found in the

appendix (cf. 3-NN of the Temporal Relational Similarity). The bitmap-

based implementation outperforms the IBSM implementation. The figure

also shows that the abortion criterion increases the performance slightly.

8.2 Performance 197

Figure 8.4: Illustration of the performance tests regarding the distance calcula-tion, as well as the results of the temporal order and measure simi-larity; a visualization of the relational similarity can be found in the appendix.

8.2.5 Proprietary Solutions vs. TIDAIS

In this chapter, the runtime performance of other proprietary solutions and

the presented TIDAIS regarding the provision of answers to questions is

measured. The following proprietary solutions are validated:

– icCube 5.1 Community Edition77,

– Oracle DBMS 12c Enterprise Edition with OLAP option installed78, and

77 http://www.iccube.com 78 https://www.oracle.com/database/index.html


– TimeDB 2.279.

Within the following the different solutions are shortly introduced and the

selection motivated:

– icCube is a high performance and real-time analytical engine. It can be

used to combine and transform data from multiple sources and answer

MDX queries. It supports non-strict, non-onto, and non-covering hier-

archies, and the usage of time intervals for dimensions. Nevertheless,

the lowest selected granularity is half-hour and hour. However, the low-

est supported and working granularity using the test datasets was day

(i.e., using hour or half-hour led to crashes). icCube is chosen as rep-

resentative for the MOLAP based data storage analytical systems.

– The Oracle DBMS 12c is one of the most sophisticated DBMS. It sup-

ports many-to-many relationships (i.e., non-strict hierarchies), by solv-

ing summarizability issues on a physical and logical level (e.g., follow-

ing ideas presented by Song et al. (2001)). The Oracle DBMS is se-

lected, because of the ROLAP based data storage, as well as the pos-

sibility to analyze data using PL/SQL.

– The TimeDB solution is a product of the temporal database community

supporting ATSQL2 as query language. It does not support dimen-

sional models. Nevertheless, several questions can be answered using

the tool, which is backed by an Oracle DBMS. Thus, the performance

of the tool is mainly influenced by the utilization of the Oracle DBMS.

To ensure fairness among the different systems, all Java based imple-

mentations (i.e., icCube, TimeDB, and TIDAIS) used a maximum of 512 MB

main memory and second memory for data storage. In addition, the used

kernel time of the tools was measured to parse the query, determine, and

return the results80. Furthermore, the caches were reset prior to each test

79 http://www.timeconsult.com/Software/Software.html 80 To achieve that with the Java implementations, the kernel time of the different threads was

measured using ThreadMXBean. In Oracle the session was altered using "ALTER SESSION SET EVENT '10046 trace name context forever, level 12';" and evaluated using TKPROF.

8.2 Performance 199

series, which consists of 20 query requests mixed other queries in be-

tween. The Oracle DBMS was set up with indexes and partitions to in-

crease the query performance. Several Oracle based results (i.e., #1, #3,

and #5) were confirmed by an Oracle specialist81 and further optimization

strategies were discussed and applied if possible, e.g., data pre-pro-

cessing, hybrid columnar compression (HCC), in-memory tables, flash

cache, and external tables. Nevertheless, most of these optimizations were

not applicable within the tests, because an Oracle Exadata solution82 would

be needed, which is out of scope regarding an applicable solution to ana-

lyze time interval data.

Table 8.4 shows the performed tests and the possibility to run each test

using the different solutions. The table shows two Oracle solutions, one

using the available OLAP option, the other one using a specifically written

PL/SQL statement to be queried against the raw dataset. Each test utilized

the introduced ground-handling dataset gh (cf. section 8.2). In addition, a

third dataset, based on the gh dataset, was created, to test the icCube

solution utilizing day granularity. The created dataset, named ghday, con-

tains the same amount of intervals and the same descriptive values as the

gh dataset. The minute granularity was resolved to a day granularity, keep-

ing the average duration. Nevertheless, the amount of chronons is reduced

to 366 (2008 was a leap year).

81 Eric Emrick worked as Oracle DBA for several years and data analysis focusing on Oracle

technologies. After performing and analyzing several tests using the Oracle DBMS and the provided data sets he stated that the "Tida results are all the more impressive" regarding the fast data retrieval.

82 The Exadata solution is the premium Oracle solution not available under 300,000 $ (basic version, hardware included, cf. Ronald Weiss (2012) and Oracle Cooperation (2015)).


Table 8.4: List of tests performed in the category "Proprietary Solutions vs. TIDAIS".

# question data-set

Oracle OLAP

OraclePL/SQL icCube TimeDB TIDAIS

1

How many tasks were performed on each day of the year 2008 per task-type?

ghday 83

2

How many tasks did each person execute per day in March?

ghday 83

3

How many re-sources are needed within each hour on the 2008-12-05?

gh

4 How many hours are worked per day in January?

gh

5 What was the maximal amount of active re-sources be-tween 2008-01-20 and 2008-01-25?

gh

The results of the tests are shown in Figure 8.5. As illustrated, the im-

plementation presented in this book outperforms the other proprietary

tools. In general, the Oracle (OLAP) solution performs best compared to

the TIDAIS. The performance regarding query #4 is explained by the

amount of records to be evaluated. The poor performance on the raw rec-

ords using PL/SQL scripts is explained by the use of a pipelined table,

which is used to generate the chronons and ensure that all time points are

83 The query was created and combined (i.e., appended to each other) programmatically.

8.3 Summary 201

covered within the result. Joining data with the virtual pipelined table is ex-

pansive and thereby slow. Nevertheless, the used PL/SQL queries only

need the existence of the PL/SQL function and data types presented in the

appendix (cf. Pipelined Table Function (PL/SQL Oracle)) and no additional

pre-processing.

Figure 8.5: Performance results of the queries used to answer the questions shown in Table 8.4.

8.3 Summary

In this chapter the TIDAIS, which is based on the presented TIDAMODEL and

the introduced bitmap-based indexes, was evaluated. The requested fea-

tures were checked regarding their fulfillment, including the usability of the

TIDAQL. The performance of the system was tested with respect to memory

usage and runtime performance. The results show that the system outper-

forms current state-of-the-art solutions. In general, the evaluation also

shows that the presented holistic solution based on bitmap-based indexes

is, in the majority of cases, faster than specialized data structures and al-

gorithm.

9 Summary and Outlook

Time interval data is ubiquitous and the requirement of analyzing such data

arises more and more frequently. In this book, an information system was

introduced, which enables the user to analyze time interval data using

known techniques like OLAP, data mining, or similarity searches. The heart

of the presented system are bitmap-based index structures, enabling the

system to process queries using the presented query language. The sys-

tem is the first system focusing on this type of data and it outperforms other

data analysis tools. Furthermore, the evaluations have shown that the bit-

map-based algorithms outperform techniques like the IntervalTree or

IBSM.

The RQs mentioned at the beginning of the book are answered within

the different sections. The needed features of an information system (cf.

RQ1) are presented in section 2.2. The aspects to be covered by a model

for time interval data analysis (cf. RQ2) are introduced in chapter 4 and

motivated by the already mentioned features, as well as an extended liter-

ature research (cf. chapter 3). Chapter 5 introduces the answer to RQ3,

which deals with the definition of a query language to enable time interval

data analysis. The indexing techniques presented in section 7.3.2 enable

the system to process the formulated queries fast, as shown in chapter 8.

In addition, the book discusses extensively the possibilities to utilize caches

and storage (cf. section 7.3.3). These aspects, i.e., indexing, caching, and

persistency, are addressed by RQ4, which is answered by the presented

results. The similarity search mentioned in RQ5 is enabled by the distance

measures introduced in chapter 6 and section 7.3.5. The last question,

RQ6, is answered within sections 7.1 (architecture), 7.2 (configuration), as

well as 7.3 and 7.4 (interfaces).

Nevertheless, the presented solution has some limitations regarding the

processing of queries, which are: (1) the concurrent processing of a query

as an atomic instance, (2) the processing of data on lowest granularity, and

(3) the constraints of the used bitmap implementations. In the following,


204 9 Summary and Outlook

each of these limitations is explained shortly and in the next paragraph

additional future research topics based on these limitations are listed. Re-

garding the processing of a query (cf. (1)), the system is capable to process

queries in parallel, but the system is not capable to split the processing of

a single query, e.g., to enable a distributed single query processing. The

system is also limited regarding the processing of each query on lowest

granularity (cf. (2)). Whenever a query is processed, the algorithm retrieves

and combines the bitmaps on lowest granularity. The usage and intelligent

creation of pre-aggregations is not supported. Finally, the used bitmap im-

plementations (cf. (3)) are generally design for the purpose of logically com-

bining bitmaps. A specific implementation for the presented use case is not

discussed or introduced. Thus, performance increases by using specifically

design implementations may be possible.

Based on these limitations, user feedback (cf. section 8.1), and current

research topics several research questions are still open and should be

addressed in future research. An important question is how the system can

utilize techniques like load balancing or distributed query processing. The

partition of the time axis and distribution of the different bitmaps within a

cluster should be investigated. Furthermore, research regarding the visu-

alization of time interval data should be focused. Another topic is the en-

hancement of the algorithm used for similarity search. Regarding the tem-

poral relational distance, potential for optimization is given. Last but not

least, the development of special mining techniques using, e.g., OLAM

techniques, should be investigated. The usage of time depended infor-

mation like vacation periods, global events, or local events, may reveal new

patterns aligned to these temporal artifacts. In addition, the mentioned, un-

fulfilled feature regarding the calculation and provision of pre-aggregates

should be implemented and discussed in the near future. The automatic

generation of such pre-aggregates by the system, i.e., by learning from the

users’ queries, should be researched.

Appendix

Pipelined Table Functions (PL/SQL Oracle)

DROP TYPE T_DATE;

DROP TYPE T_DATE_ROW;

CREATE TYPE T_DATE_ROW AS OBJECT (start_date DATE, end_date DATE);

/

CREATE TYPE T_DATE IS TABLE OF T_DATE_ROW;

/

CREATE OR REPLACE FUNCTION

F_DATES(start_date IN DATE, end_date IN DATE) RETURN T_DATE AS

res T_DATE := T_DATE();

diff pls_integer;

cur DATE;

nxt DATE;

BEGIN

diff := (end_date ‐ start_date) * 24 * 60;

cur := start_date;

nxt := NULL;

FOR i IN 1 .. diff LOOP

nxt := start_date + (i / 24 / 60);

res.extend;

res(res.last) := T_DATE_ROW(cur, nxt);

cur := nxt;

END LOOP;

RETURN res;

END;

/

© Springer Fachmedien Wiesbaden GmbH 2016P. Meisen, Analyzing Time Interval Data,DOI 10.1007/978-3-658-15728-9

206 Appendix

A Complete Sample Model-Configuration-File


<!‐‐ offlinemode (i.e. what should happen if a dataretriever is not available)

is optional and can be one of the following values (case‐insensitive):

+ true, y, yes

+ false, n, no

+ auto

‐‐>

<model xmlns="http://dev.meisen.net/xsd/dissertation/model"

xmlns:advDes="http://dev.meisen.net/xsd/dissertation/model/advancedDescriptors"

xmlns:idx="http://dev.meisen.net/xsd/dissertation/model/indexes"

xmlns:map="http://dev.meisen.net/xsd/dissertation/model/mapper"

xmlns:dim="http://dev.meisen.net/xsd/dissertation/dimension"

xmlns:spp="http://dev.meisen.net/xsd/dissertation/preprocessor/script"

xmlns:xsi="http://www.w3.org/2001/XMLSchema‐instance"

xsi:schemaLocation="

http://dev.meisen.net/xsd/dissertation/model

http://dev.meisen.net/xsd/dissertation/tidaModel.xsd

http://dev.meisen.net/xsd/dissertation/model/indexes

http://dev.meisen.net/xsd/dissertation/tidaIndexFactory.xsd

http://dev.meisen.net/xsd/dissertation/model/advancedDescriptors

http://dev.meisen.net/xsd/dissertation/tidaAdvancedDescriptors.xsd

http://dev.meisen.net/xsd/dissertation/model/mapper

http://dev.meisen.net/xsd/dissertation/tidaMapperFactory.xsd

http://dev.meisen.net/xsd/dissertation/preprocessor/script

http://dev.meisen.net/xsd/dissertation/tidaScriptPreProcessor.xsd

http://dev.meisen.net/xsd/dissertation/dimension

http://dev.meisen.net/xsd/dissertation/tidaDimension.xsd"

offlinemode="false" folder="_data/fullModel"

id="fullModel" name="My wonderful Model">

<config>

<caches>

<!‐‐ Define the cache to be used for metadata.

The following cache implementations are available:

+ net.meisen.dissertation.impl.cache.MemoryIdentifierCache

+ net.meisen.dissertation.impl.cache.FileIdentifierCache

‐‐>

<identifier

implementation="net.meisen.dissertation.impl.cache.MemoryIdentifierCache" />

<!‐‐ Define the cache to be used for meta‐information (i.e. the descriptors).


+ net.meisen.dissertation.impl.cache.MemoryMetaDataCache

+ net.meisen.dissertation.impl.cache.FileMetaDataCache

‐‐>

<metadata

implementation="net.meisen.dissertation.model.descriptors.mock.MockMetaDataCache" />

<!‐‐ Define the cache to be used for bitmaps.



+ net.meisen.dissertation.impl.cache.MemoryBitmapCache

+ net.meisen.dissertation.impl.cache.MapDbBitmapCache

+ net.meisen.dissertation.impl.cache.FileBitmapCache

‐‐>

<bitmap implementation="net.meisen.dissertation.impl.cache.MemoryBitmapCache" />

<!‐‐ Define the cache to be used for fact‐sets.


+ net.meisen.dissertation.impl.cache.MemoryFactDescriptorModelSetCache

+ net.meisen.dissertation.impl.cache.FileFactDescriptorModelSetCache

+ net.meisen.dissertation.impl.cache.MapDbFactDescriptorModelSetCache

‐‐>

<factsets

implementation="net.meisen.dissertation.impl.cache.

MemoryFactDescriptorModelSetCache" />

<!‐‐ Define the cache to be used for records.


+ net.meisen.dissertation.impl.cache.IdsOnlyDataRecordCache

+ net.meisen.dissertation.impl.cache.MemoryDataRecordCache

+ net.meisen.dissertation.impl.cache.MapDbDataRecordCache

‐‐>

<records

implementation="net.meisen.dissertation.impl.cache.IdsOnlyDataRecordCache" />

</caches>

<factories>

<!‐‐ Define the factory to be used to determine which IndexFactory to be used.

The following factories are available by default:

+ net.meisen.dissertation.impl.indexes.IndexFactory

‐‐>

<indexes implementation="net.meisen.dissertation.impl.indexes.IndexFactory">

<!‐‐ Define the different indexes to be used.

The following bitmap‐indexes are available by default:

+ net.meisen.dissertation.impl.indexes.datarecord.slices.EWAHBitmap

+ net.meisen.dissertation.impl.indexes.datarecord.slices.RoaringBitmap

The following implementations are by default available for specific

primitive data types:

+ net.meisen.dissertation.impl.indexes.FastUtilIntIndexedCollection

+ net.meisen.dissertation.impl.indexes.FastUtilLongIndexedCollection

+ net.meisen.dissertation.impl.indexes.HppcIntIndexedCollection

+ net.meisen.dissertation.impl.indexes.HppcLongIndexedCollection

+ net.meisen.dissertation.impl.indexes.TroveByteIndexedCollection

+ net.meisen.dissertation.impl.indexes.TroveShortIndexedCollection

+ net.meisen.dissertation.impl.indexes.TroveIntIndexedCollection

+ net.meisen.dissertation.impl.indexes.TroveLongIndexedCollection

‐‐>

<idx:config

bitmap="net.meisen.dissertation.impl.indexes.datarecord.slices.

EWAHBitmap"

byte="net.meisen.dissertation.impl.indexes.TroveByteIndexedCollection"

short="net.meisen.dissertation.impl.indexes.TroveShortIndexedCollection"

208 Appendix

int="net.meisen.dissertation.impl.indexes.TroveIntIndexedCollection"

long="net.meisen.dissertation.impl.indexes.TroveLongIndexedCollection" />

</indexes>

<!‐‐ Define the factory to be used to determine which MapperFactory

to be used.


+ net.meisen.dissertation.impl.time.mapper.MapperFactory

‐‐>

<mappers implementation="net.meisen.dissertation.impl.time.mapper.MapperFactory">

<!‐‐ The inheritance of the default mappers (default means in that case the once

defined in the global configuration) can be true or false.

‐‐>

<map:config inheritDefault="true">

<!‐‐

Adds mappers to the default once. The default once cannot be removed and are:

+ net.meisen.dissertation.impl.time.mapper.DateMapper

+ net.meisen.dissertation.impl.time.mapper.LongMapper

‐‐>

<map:mapper

implementation="net.meisen.dissertation.impl.time.mapper.DateMapper" />

<map:mapper

implementation="net.meisen.dissertation.impl.time.mapper.LongMapper" />

</map:config>

</mappers>

<!‐‐

Adds schedules executed if the model is loaded. By default, the following

jobs are available:

+ net.meisen.dissertation.impl.scheduler.QueryJob

+ net.meisen.dissertation.impl.scheduler.SendEmailJob

‐‐>

<schedules>

<schedule cron="*/15 4‐16 * * 6,7"

implementation=" net.meisen.dissertation.impl.scheduler.QueryJob">

<qj:query>SELECT COUNT(RECORDS) FROM myModel</qj:query>

</schedule>

<schedule event="core:query"

implementation=" net.meisen.dissertation.impl.scheduler.SendEmailJob" />

</schedules>

</factories>

<!‐‐ Define the pre‐processer used to modify the incoming record.

The following implementations are available by default:

+ net.meisen.dissertation.impl.dataintegration.IdentityPreProcessor

+ net.meisen.dissertation.impl.dataintegration.ScriptPreProcssor

‐‐>

<preprocessor

implementation="net.meisen.dissertation.impl.dataintegration.ScriptPreProcessor">

<spp:script language="javascript">

/*

* Here is my script:

* ‐ the script gets the raw‐record injected as raw


* ‐ the script must set a result with an IDataRecord instance

* ‐ the script should not modify the raw record

*/

var result = raw;

</spp:script>

</preprocessor>

</config>

<time>

<timeline start="20.01.1981" duration="100" granularity="YEAR" />

</time>

<meta>

<!‐‐ As identifier‐factory the following implementations are available:

+ net.meisen.dissertation.impl.idfactories.IntegerIdsFactory

+ net.meisen.dissertation.impl.idfactories.LongIdsFactory

+ net.meisen.dissertation.impl.idfactories.UuIdsFactory

The null‐attribute (true or false) defines if null values are allowed within

the model.

The failonduplicates‐attributes (true or false) specifies if duplicates are

just ignored or if an exception is thrown.

‐‐>

<descriptors>

<string id="R1" failonduplicates="true" null="false" name="person"

idfactory="net.meisen.dissertation.impl.idfactories.UuIdsFactory" />

<string id="R2" null="false" name="toy"

idfactory="net.meisen.dissertation.impl.idfactories.UuIdsFactory" />

<string id="R3" null="true" />

<string id="D1" name="funFactor" />

<integer id="D2" name="smiles" />

<string id="D3" />

<advDes:list id="D4"

idfactory="net.meisen.dissertation.impl.idfactories.LongIdsFactory" />

</descriptors>

<entries>

<entry descriptor="R1" value="Philipp" />

<entry descriptor="R1" value="Debbie" />

<entry descriptor="R1" value="Edison" />

<entry descriptor="R2" value="rattle" />

<entry descriptor="R2" value="teddy" />

<entry descriptor="R2" value="cup" />

<entry descriptor="R2" value="doll" />

<entry descriptor="D1" value="no" />

<entry descriptor="D1" value="low" />

<entry descriptor="D1" value="average" />

<entry descriptor="D1" value="high" />

<entry descriptor="D1" value="very high" />

<entry descriptor="D2" value="1" />


210 Appendix




<entry descriptor="D3" value="Some Value" />

<entry descriptor="D4" value="A,B,C" />

<entry descriptor="D4" value="D,E,F,G,H" />

<entry descriptor="D4" value="I" />

</entries>

</meta>

<dim:dimensions>

<dim:dimension id="PERSON" descriptor="R1">

<dim:hierarchy id="GENDER" all="All Persons">

<dim:level id="GENDER">

<dim:member id="MALE" reg="Philipp|Edison" rollUpTo="*" />

<dim:member id="FEMALE" reg="Debbie" rollUpTo="*" />

</dim:level>

</dim:hierarchy>

</dim:dimension>

<dim:dimension id="TOY" descriptor="R2">

<dim:hierarchy id="TYPE" all="All Types">

<dim:level id="TYPE">

<dim:member id="WOOD" reg="rattle" rollUpTo="*" />

<dim:member id="STUFF" reg="teddy|doll" rollUpTo="*" />

<dim:member id="MISC" reg="cup" rollUpTo="*" />

</dim:level>

</dim:hierarchy>

</dim:dimension>

</dim:dimensions>

<!‐‐ no data is added so we don’t need any structure ‐‐>

<structure />

<!‐‐ MetaDataHandling (i.e. what has to be done if no Descriptor is available so far)

is optional and can be one of the following values (case‐insensitive):

+ handleAsNull, null

+ createDescriptor, create, add

+ failOnError, fail

IntervalDataHandling (i.e. what has to be done if an interval‐value is null)

+ boundariesWhenNull, boundaries

+ useOther, other, others

+ failOnNull, fail

‐‐>

<data metahandling="create" intervalhandling="boundariesWhenNull" />

</model>


A Complete Sample Configuration-File


<config xmlns="http://dev.meisen.net/xsd/dissertation/config"

xmlns:xsi="http://www.w3.org/2001/XMLSchema‐instance"

xmlns:idx="http://dev.meisen.net/xsd/dissertation/model/indexes"

xmlns:map="http://dev.meisen.net/xsd/dissertation/model/mapper"

xsi:schemaLocation="http://dev.meisen.net/xsd/dissertation/config

http://dev.meisen.net/xsd/dissertation/tidaConfig.xsd

http://dev.meisen.net/xsd/dissertation/model/indexes

http://dev.meisen.net/xsd/dissertation/tidaIndexFactory.xsd

http://dev.meisen.net/xsd/dissertation/model/mapper

http://dev.meisen.net/xsd/dissertation/tidaMapperFactory.xsd">

<!‐‐ Define the location the server stores it's data at ‐‐>

<location folder="_data" />

<auth>

<!‐‐ Define the manager used for authentication:


+ net.meisen.dissertation.impl.auth.AllAccessAuthManager

+ net.meisen.dissertation.impl.auth.shiro.ShiroAuthManager

‐‐>

<manager implementation="net.meisen.dissertation.impl.auth.AllAccessAuthManager" />

</auth>

<caches>



+ net.meisen.dissertation.impl.cache.MemoryIdentifierCache

+ net.meisen.dissertation.impl.cache.FileIdentifierCache

‐‐>

<identifier implementation="net.meisen.dissertation.impl.cache.MemoryIdentifierCache" />



+ net.meisen.dissertation.impl.cache.MemoryMetaDataCache

+ net.meisen.dissertation.impl.cache.FileMetaDataCache

‐‐>

<metadata implementation="net.meisen.dissertation.impl.cache.MemoryMetaDataCache" />

<!‐‐ Define the cache to be used for bitmaps.


+ net.meisen.dissertation.impl.cache.MemoryBitmapCache

+ net.meisen.dissertation.impl.cache.MapDbBitmapCache

+ net.meisen.dissertation.impl.cache.FileBitmapCache

‐‐>

<bitmap implementation="net.meisen.dissertation.impl.cache.MemoryBitmapCache" />

212 Appendix

<!‐‐ Define the cache to be used for fact‐sets.


+ net.meisen.dissertation.impl.cache.MemoryFactDescriptorModelSetCache

+ net.meisen.dissertation.impl.cache.FileFactDescriptorModelSetCache

‐‐>

<factsets

implementation="net.meisen.dissertation.impl.cache.MemoryFactDescriptorModelSetCache" />

<!‐‐ Define the cache to be used for records.


+ net.meisen.dissertation.impl.cache.IdsOnlyDataRecordCache

+ net.meisen.dissertation.impl.cache.MemoryDataRecordCache

+ net.meisen.dissertation.impl.cache.MapDbDataRecordCache

‐‐>

<records implementation="net.meisen.dissertation.impl.cache.IdsOnlyDataRecordCache" />

</caches>

<factories>

<!‐‐ Define the factory to be used to determine which IndexFactory

to be used.


+ net.meisen.dissertation.impl.indexes.IndexFactory

‐‐>

<indexes implementation="net.meisen.dissertation.impl.indexes.IndexFactory">

<!‐‐ Define the different indexes to be used.

The following bitmap‐indexes are available by default:

+ net.meisen.dissertation.impl.indexes.datarecord.slices.EWAHBitmap

+ net.meisen.dissertation.impl.indexes.datarecord.slices.RoaringBitmap

The following implementations are by default available for specific

primitive data types:

+ net.meisen.dissertation.impl.indexes.FastUtilIntIndexedCollection

+ net.meisen.dissertation.impl.indexes.FastUtilLongIndexedCollection

+ net.meisen.dissertation.impl.indexes.HppcIntIndexedCollection

+ net.meisen.dissertation.impl.indexes.HppcLongIndexedCollection

+ net.meisen.dissertation.impl.indexes.TroveByteIndexedCollection

+ net.meisen.dissertation.impl.indexes.TroveShortIndexedCollection

+ net.meisen.dissertation.impl.indexes.TroveIntIndexedCollection

+ net.meisen.dissertation.impl.indexes.TroveLongIndexedCollection

‐‐>

<idx:config bitmap="net.meisen.dissertation.impl.indexes.datarecord.slices.EWAHBitmap"

byte="net.meisen.dissertation.impl.indexes.TroveByteIndexedCollection"

short="net.meisen.dissertation.impl.indexes.TroveShortIndexedCollection"

int="net.meisen.dissertation.impl.indexes.TroveIntIndexedCollection"

long="net.meisen.dissertation.impl.indexes.TroveLongIndexedCollection" />

</indexes>

<!‐‐ Define the factory to be used to determine which MapperFactory

to be used.



+ net.meisen.dissertation.impl.time.mapper.MapperFactory

‐‐>

<mappers implementation="net.meisen.dissertation.impl.time.mapper.MapperFactory">

<map:config>

<!‐‐

Adds mappers to the default once. The default once cannot be removed and are:

+ net.meisen.dissertation.impl.time.mapper.DateMapper

+ net.meisen.dissertation.impl.time.mapper.LongMapper

‐‐>

<map:mapper implementation="net.meisen.dissertation.impl.time.mapper.DateMapper" />

<map:mapper implementation="net.meisen.dissertation.impl.time.mapper.LongMapper" />

</map:config>

</mappers>

<!‐‐ Define the factory to be used to determine the granularity‐factory.


+ net.meisen.dissertation.impl.time.granularity.TimeGranularityFactory

‐‐>

<granularities

implementation="net.meisen.dissertation.impl.time.granularity.TimeGranularityFactory" />

<!‐‐ Define the factory to be used to create queries.


+ net.meisen.dissertation.impl.parser.query.QueryFactory

‐‐>

<queries implementation="net.meisen.dissertation.impl.parser.query.QueryFactory" />

</factories>

<!‐‐ Adds additional aggregation‐functions or overrides the once available

by default.

The following aggregation‐functions are added by default:

+ net.meisen.dissertation.impl.measures.Count

+ net.meisen.dissertation.impl.measures.Min

+ net.meisen.dissertation.impl.measures.Max

+ net.meisen.dissertation.impl.measures.Sum

+ net.meisen.dissertation.impl.measures.Mean

+ net.meisen.dissertation.impl.measures.Median

+ net.meisen.dissertation.impl.measures.Mode

‐‐>

<aggregations>

<function implementation="net.meisen.dissertation.impl.measures.Count" />

</aggregations>

<!‐‐ Adds additional templates. By default the following are added (and

cannot be overridden:

The following aggregation‐functions are added by default:

+ net.meisen.dissertation.model.dimensions.templates.All

214 Appendix

+ net.meisen.dissertation.model.dimensions.templates.Years

+ net.meisen.dissertation.model.dimensions.templates.Months

+ net.meisen.dissertation.model.dimensions.templates.Days

+ net.meisen.dissertation.model.dimensions.templates.Hours

+ net.meisen.dissertation.model.dimensions.templates.Minutes

+ net.meisen.dissertation.model.dimensions.templates.Seconds

+ net.meisen.dissertation.model.dimensions.templates.Rasters

‐‐>

<timetemplates>

<template implementation="net.meisen.dissertation.model.dimensions.templates.Minutes" />

</timetemplates>

<!‐‐ Specify the analysis techniques available ‐‐>

<analyses>

<analysis id="knn" implementation="net.meisen.dissertation.similarity.TidaKnnAnalysis" />

</analyses >

<!‐‐ Server settings ‐‐>

<server>

<!‐‐ the timeout is defined in minutes ‐‐>

<http port="7000" timeout="30" enable="true" docroot="docroot" />

<!‐‐ the timeout is defined in seconds ‐‐>

<tsql port="7001" timeout="1800000" enable="true" />

<control port="7002" enable="false" />

</server>

</config>

Detailed Overview of the Runtime Performance 215

Detailed Overview of the Runtime Performance

query algorithm avg [s] query algorithm avg [s] #1a TIDA 0.088 #3a Naive 0.594

#1a IntTreeB 0.093 #3b TIDA 0.103

#1a Naive 3.297 #3b IntTreeB 0.137

#1a IntTreeA 3.453 #3b IntTreeA 5.109

#1b TIDA 0.095 #3b Naive 5.125

#1b IntTreeB 0.143 #3c TIDA 0.149

#1b Naive 39.266 #3c IntTreeB 0.549

#1b IntTreeA 39.453 #3c IntTreeA 76.141

#1c TIDA 0.139 #3c Naive 77.953

#1c IntTreeB 0.812 #4a IntTreeB 0.074

#1c Naive 633.422 #4a TIDA 0.094

#1c IntTreeA 651.781 #4a IntTreeA 0.516

#2a TIDA 0.011 #4a Naive 0.547

#2a IntTreeB 0.019 #4b IntTreeB 0.095

#2a Naive 0.375 #4b TIDA 0.105

#2a IntTreeA 0.375 #4b IntTreeA 4.922

#2b TIDA 0.026 #4b Naive 5.078

#2b IntTreeB 0.059 #4c TIDA 0.169

#2b IntTreeA 3.813 #4c IntTreeB 0.345

#2b Naive 4.016 #4c Naive 67.359

#2c TIDA 0.063 #4c IntTreeA 70.891

#2c IntTreeB 0.498 #5a IntTreeB 0.084

#2c IntTreeA 42.328 #5a TIDA 0.087

#2c Naive 43.234 #5a IntTreeA 0.422

#3a TIDA 0.091 #5a Naive 0.453

#3a IntTreeB 0.093 #5b TIDA 0.099

#3a IntTreeA 0.547 #5b IntTreeB 0.115

216 Appendix

query algorithm avg [s] query algorithm avg [s] #5b IntTreeA 4.172 #9 Naive 3.172

#5b Naive 4.234 #9 IntTreeA 3.219

#5c TIDA 0.135 #9 IntTreeB 0.473

#5c IntTreeB 0.384

#5c IntTreeA 57.531

#5c Naive 58.609

#6a IntTreeB 0.002

#6a Naive 0.016

#6a IntTreeA 0.016

#6a TIDA 0.020

#6b IntTreeB 0.017

#6b IntTreeA 0.031

#6b TIDA 0.041

#6b Naive 0.125

#6c TIDA 0.163

#6c IntTreeB 0.179

#6c IntTreeA 0.359

#6c Naive 1.203

#7 TIDA 0.088

#7 IntTreeB 0.267

#7 IntTreeA 353.813

#7 Naive 368.953

#8 TIDA 0.031

#8 IntTreeB 0.053

#8 IntTreeA 0.406

#8 Naive 0.469

#9 TIDA 0.272

3-NN of the Temporal Relational Similarity 217

3-NN of the Temporal Relational Similarity

Bibliography

Abdelouarit, E.; El Merouani, M.; Medouri, A. (2013): Data Warehouse Tuning. The Supremacy

of Bitmap Index. In IJCA 79 (7), pp. 7–10. DOI: 10.5120/13751-1573.

Agarwal, S.; Agrawal, R.; Deshpande, P.; Gupta, A.; Naughton, J. F.; Ramakrishnan, R.; Sara-

wagi, S. (1996): On the Computation of Multidimensional Aggregates. In: Proceedings of the

22th International Conference on Very Large Data Bases. San Francisco, CA, USA: Morgan

Kaufmann Publishers Inc (VLDB ’96), pp. 506–521.

Aigner, W.; Federico, P.; Gschwandtner, T.; Miksch, S.; Alexander Rind (Eds.) (2012): Chal-

lenges of Time-oriented Data in Visual Analytics for Healthcare. IEEE VisWeek Workshop on

Visual Analytics in Healthcare (VAHC). Seattle, USA. IEEE: IEEE.

Aigner, W.; Miksch, S.; Müller, W.; Schumann, H.; Tominski, C. (2007): Visualizing time-ori-

ented data - A systematic view. In Computers & Graphics 31 (3), pp. 401–409. DOI:

10.1016/j.cag.2007.01.030.

Aigner, W.; Miksch, S.; Schumann, H.; Tominski, C. (2011): Visualization of Time-Oriented

Data. Guildford, Surrey: Springer London (Human-Computer Interaction Series).

Allen, J. F. (1983): Maintaining Knowledge about Temporal Intervals. In Communications of

the ACM 26 (11), pp. 832–843. DOI: 10.1145/182.358434.

Alur, R.; Henzinger, T. A. (1992): Logics and models of real time: A survey. In J. W. de Bakker,

C. Huizing, W. P. de Roever, G. Rozenberg (Eds.): Real-Time: Theory in Practice, vol. 600:

Springer Berlin Heidelberg (Lecture Notes in Computer Science), pp. 74–106.

Al-Zoubi, H.; Milenkovic, A.; Milenkovic, M. (2004): Performance evaluation of Cache Replace-

ment Policies for the SPEC CPU2000 Benchmark Suite. In S.-M. Yoo, L. H. Etzkorn (Eds.):

42nd annual Southeast Regional Conference. Huntsville, Alabama, pp. 267–272.

Apache Shiro Group (2015): Apache Shiro. Java Security Framework. Available online at

http://shiro.apache.org/, updated on 6/12/2015, checked on 6/12/2015.

Arroyo, J.; González-Rivera, G.; Maté, C. (2010): Forecasting with Interval and Histogram

Data. In A. Ullah, D. Giles (Eds.): Handbook of Empirical Economics and Finance, vol.

20103666: Chapman and Hall/CRC (Statistics: A Series of Textbooks and Monographs),

pp. 247–279.

Batal, I.; Valizadegan, H.; Cooper, G. F.; Hauskrecht, M. (2011): A Pattern Mining Approach

for Classifying Multivariate Temporal Data. In F.-X. Wu (Ed.): IEEE International Conference

on Bioinformatics and Biomedicine (BIBM 2011), vol. 2011. IEEE International Conference on

Bioinformatics and Biomedicine (BIBM). Atlanta, Georgia, USA, 12 - 15 Nov. 2011. Pisca-

taway, NJ: IEEE, pp. 358–365.

© Springer Fachmedien Wiesbaden GmbH 2016P. Meisen, Analyzing Time Interval Data,DOI 10.1007/978-3-658-15728-9

220 Bibliography

Bayer, R.; McCreight, E. M. (1972): Organization and Maintenance of Large Ordered Indexes.

In Acta Informatica 1 (3), pp. 173–189. DOI: 10.1007/BF00288683.

Bębel, B.; Morzy, M.; Morzy, T.; Królikowski, Z.; Wrembel, R. (2012): OLAP-Like Analysis of

Time Point-Based Sequential Data. In D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F.

Mattern, J. C. Mitchell et al. (Eds.): Advances in Conceptual Modeling, vol. 7518. Berlin, Hei-

delberg: Springer Berlin Heidelberg (Lecture Notes in Computer Science), pp. 153–161.

Bentley, J. L. (1977): Solutions to Klee’s Rectangle Problems. In Unpublished manuscript,

Dept of Computer Science, Carnegie-Mellon University, Pittsburgh PA.

Berendt, B. (1996): Explaining Preferred Mental Models in Allen Inferences with a Metrical

Model of Imagery. In G. W. Cottrell (Ed.): Proceedings of the 18th Annual Conference of the

Cognitive Science Society. 18th Annual Conference of the Cognitive Science Society: Law-

rence Erlbaum, pp. 489–494.

Berg, M. de; Cheong, O.; van Kreveld, M.; Overmars, M. (2008): More Geometric Data Struc-

tures. In M. de Berg, O. Cheong, M. van Kreveld, M. Overmars (Eds.): Computational Geom-

etry. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 219–241.

Bergeron, M.; Conklin, D. (2011): Subsumption of Vertical Viewpoint Patterns. In C. Agon, M.

Andreatta, G. Assayag, E. Amiot, J. Bresson, J. Mandereau (Eds.): Mathematics and Compu-

tation in Music, vol. 6726: Springer Berlin Heidelberg (Lecture Notes in Computer Science),

pp. 1–12.

Böhlen, M.; Gamper, J.; Jensen, C. S. (2006): Multi-dimensional Aggregation for Temporal

Data. In D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mitchell et al.

(Eds.): Advances in Database Technology - EDBT 2006, vol. 3896. Berlin, Heidelberg:


Böhlen, M. H.; Gamper, J.; Jensen, C. S. (2008): Towards General Temporal Aggregation. In

A. Gray, K. Jeffery, J. Shao (Eds.): Sharing Data, Information and Knowledge, vol. 5071. Berlin,

Heidelberg: Springer Berlin Heidelberg (Lecture Notes in Computer Science), pp. 257–269.

Böhlen, M. H.; Jensen, C. S.; Snodgrass, R. T. (1995): A Seamless Integration of Time

into SQL. Technical Report. Aalborg University (TRR-96-2049).

Böhlen, M. H.; Jensen, C. S.; Snodgrass, R. T. (2000): Temporal statement modifiers. In ACM

Trans. Database Syst. 25 (4), pp. 407–456. DOI: 10.1145/377674.377665.

Bongki Moon; Vega Lopez, I. F.; Immanuel, V. (2003): Efficient Algorithms for large-scale Tem-

poral Aggregation. In IEEE Trans. Knowl. Data Eng. 15 (3), pp. 744–759. DOI:

10.1109/TKDE.2003.1198403.

Bibliography 221

Boonstra-Hörwein, K.; Punzengruber, D.; Gärtner, J. (2011): Reducing understaffing and shift

work with Temporal Profile Optimization (TPO). In Applied Ergonomics 42 (2), pp. 233–237.

DOI: 10.1016/j.apergo.2010.06.008.

Brett Wooldridge (2015): JHM benchmarks for JDBC Connection Pools. GitHub.com. Availa-

ble online at https://github.com/brettwooldridge/HikariCP-benchmark, updated on 5/26/2015,

checked on 6/12/2015.

Carmel, E. (1999): Global software teams. Collaborating across borders and time zones. Up-

per Saddle River, NJ: Prentice Hall.

Catarci, T.; Santucci, G. (1995): Diagrammatic vs Textual Query Languages: A Comparative

Experiment. In S. Spaccapietra, R. Jain (Eds.): Visual Database Systems 3. Boston, MA:

Springer US, pp. 69–83.

Celko, J. (2006): Joe Celko's analytics and OLAP in SQL. San Francisco, Calif., Oxford: Mor-

gan Kaufmann; Elsevier Science [distributor] (The Morgan Kaufmann series in data manage-

ment systems).

Chamberlin, D. D.; Boyce, R. F. (1976): SEQUEL: A Structured English Query Language. In G.

Altshuler, R. Rustin, B. Plagman (Eds.): ACM SIGFIDET (now SIGMOD) workshop. Not

Known, pp. 249–264.

Chambi, S.; Lemire, D.; Kaser, O.; Godin, R. (2015): Better bitmap performance with Roaring

bitmaps. In Softw. Pract. Exper. DOI: 10.1002/spe.2325.

Chan, C.-Y.; Ioannidis, Y. E. (1998): Bitmap Index Design and Evaluation. In L. Haas, P. Drew,

A. Tiwary, M. Franklin (Eds.): ACM SIGMOD International Conference. Seattle, Washington,

United States, pp. 355–366.

Chan, C.-Y.; Ioannidis, Y. E. (1999): An Efficient Bitmap Encoding Scheme for Selection Que-

ries. In S. B. Davidson, C. Faloutsos (Eds.): ACM SIGMOD International Conference. Phila-

delphia, Pennsylvania, United States, pp. 215–226.

Chaudhuri, S.; Dayal, U. (1997): An overview of data warehousing and OLAP technology. In

SIGMOD Rec. 26 (1), pp. 65–74. DOI: 10.1145/248603.248616.

Chen, Y.-C.; Peng, W.-C.; Lee, S.-Y. (2011): CEMiner - An Efficient Algorithm for Mining Closed

Patterns from Time Interval-Based Data. In: IEEE 11th International Conference on Data Min-

ing (ICDM 2011). Vancouver, BC, Canada, pp. 121–130.

Christie, R. D. (2003): Statistical classification of major event days in distribution system reli-

ability. In IEEE Trans. Power Delivery 18 (4), pp. 1336–1341. DOI:

10.1109/TPWRD.2003.810491.

222 Bibliography

Chui, C. K.; Kao, B.; Lo, E.; Cheung, D. (2010): S-OLAP: An OLAP system for analyzing se-

quence data. In A. Elmagarmid, D. Agrawal (Eds.): Proceedings of the 2010 ACM SIGMOD

International Conference on Management of Data. Indianapolis, Indiana, USA, pp. 1131–

1134.

Colantonio, A.; Di Pietro, R. (2010): Concise: Compressed ‘n’ Composable Integer Set. In In-

formation Processing Letters 110 (16), pp. 644–650. DOI: 10.1016/j.ipl.2010.05.018.

Combi, C.; Gozzi, M.; Juarez, J. M.; Marin, R.; Oliboni, B. (2007): Querying Clinical Workflows

by Temporal Similarity. In R. Bellazzi, A. Abu-Hanna, J. Hunter (Eds.): Artificial Intelligence in

Medicine, vol. 4594: Springer Berlin Heidelberg (Lecture Notes in Computer Science),

pp. 469–478.

Cood, E. F.; Cood, S. B.; Salley, C. T. (1993): Providing OLAP (On-line Analytical Processing)

to User-analysts: An IT Mandate. (White Paper). Available online at http://www.minet.uni-

jena.de/dbis/lehre/ss2005/sem_dwh/lit/Cod93.pdf, checked on 5/12/2015.

Cuzzocrea, A. (2011): Retrieving Accurate Estimates to OLAP Queries over Uncertain and

Imprecise Multidimensional Data Streams. In D. Hutchison, T. Kanade, J. Kittler, J. M. Klein-

berg, F. Mattern, J. C. Mitchell et al. (Eds.): Scientific and Statistical Database Management,

vol. 6809. Berlin, Heidelberg: Springer Berlin Heidelberg (Lecture Notes in Computer Sci-

ence), pp. 575–576.

Deliège, F.; Pedersen, T. B. (2010): Position List Word Aligned Hybrid: Optimizing Space and

Performance for Compressed Bitmaps. In I. Manolescu, S. Spaccapietra, J. Teubner, M.

Kitsuregawa, A. Leger, F. Naumann et al. (Eds.): 13th International Conference. Lausanne,

Switzerland, pp. 228–239.

DeWitt, D. J.; Katz, R. H.; Olken, F.; Shapiro, L. D.; Stonebraker, M. R.; Wood, D. (1984): Im-

plementation Techniques for Main Memory Database Systems. In D. Smith, B. Yormark (Eds.):

ACM SIGMOD International Conference. Boston, Massachusetts, p. 1.

Dignös, A.; Böhlen, M. H.; Gamper, J. (2014): Overlap Interval Partition Join. In C. Dyreson, F.

Li, M. T. Özsu (Eds.): ACM SIGMOD International Conference. Snowbird, Utah, USA,

pp. 1459–1470.

Dodge, Y.; Marriott, F. H. C. (2006): The Oxford dictionary of statistical terms. 6th ed. Oxford,

New York: Oxford University Press.

Dyreson, C.; Grandi, F.; Käfer, W.; Kline, N.; Lorentzos, N.; Mitsopoulos, Y. et al. (1994): A

Consensus Glossary of Temporal Database Concepts. In SIGMOD Rec 23 (1), pp. 52–64.

DOI: 10.1145/181550.181560.

Edelsbrunner, H.; Maurer, H. A. (1981): On the Intersection of Orthogonal Objects. In Infor-

mation Processing Letters 13 (4, 5), pp. 177–181.

Bibliography 223

Enderle, J.; Hampel, M.; Seidl, T. (2004): Joining Interval Data in Relational Databases. In P.

Valduriez, G. Weikum, A. C. König, S. Dessloch (Eds.): ACM SIGMOD International Con-

ference. Paris, France, pp. 683–694.

Espinosa, J. A.; Ning Nan; Carmel, E. (2007): Do Gradations of Time Zone Separation Make

a Difference in Performance? A First Laboratory Study. In: Global Software Engineering, 2007.

ICGSE 2007. Second IEEE International Conference on, pp. 12–22.

Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P. (1996): The KDD process for extracting useful

knowledge from volumes of data. In Communications of the ACM 39 (11), pp. 27–34. DOI:

10.1145/240455.240464.

Fricker, D.; Zhang, H.; Yu, C. (2011): Sequential Pattern Mining of Multimodal Data Streams in

Dyadic Interactions. In: 2011 IEEE Joint International Conference on Development and Learn-

ing and Epigenetic Robotics. IEEE Joint International Conference on Development and Learn-

ing and Epigenetic Robotics. Frankfurt am Main, Germany, 24-27 August 2011. Piscataway:

IEEE, pp. 1–6.

Frühwirth, T. (1996): Temporal Annotated Constraint Logic Programming. In Journal of Sym-

bolic Computation 22 (5–6), pp. 555–583. DOI: 10.1006/jsco.1996.0066.

Garcia-Molina, H.; Ullman, J. D.; Widom, J. (2014): Database Systems. The Complete Book.

Second edition. Harlow: Pearson Education Limited.

Goodchild, M. F. (1987): A Spatial Analytical Perspective on Geographical Information Sys-

tems. In International journal of geographical information systems 1 (4), pp. 327–334. DOI:

10.1080/02693798708927820.

Goodrich, M. T.; Tamassia, R. (2006): Data Structures and Algorithms in Java. 4th ed. Hobo-

ken, NJ: Wiley.

Gordevicius, J.; Gamper, J.; Böhlen, M. H. (2012): Parsimonious Temporal Aggregation. In The

VLDB Journal 21 (3), pp. 309–332. DOI: 10.1007/s00778-011-0243-9.

Grabbe, S. R.; Sridhar, B.; Mukherjee, A. (2014): Clustering Days with Similar Airport Weather

Conditions. In: 14th AIAA Aviation Technology, Integration, and Operations Conference. At-

lanta, GA.

Gui, H.; Au, G.; Bouloy, C. (2011): Aggregate Join Index Utilization in Query Processing:

Google Patents. Available online at https://www.google.com/patents/US7912833.

Guo, H.; Tang, Y.; Yang, X.; Ye, X. (2010): Improvement and Extension to ATSQL2. In Y. Tang,

X. Ye, N. Tang (Eds.): Temporal Information Processing Technology and Its Application. Berlin,

Heidelberg: Springer Berlin Heidelberg, pp. 245–259.

Gupta, A.; Mumick, I. S. (Eds.) (1999): Materialized Views. Cambridge, MA, USA: MIT Press.

224 Bibliography

Guttman, A. (1984): R-Trees: A Dynamic Index Structure for Spatial Searching. In SIGMOD

Rec. 14 (2), pp. 47–57. DOI: 10.1145/971697.602266.

Guyet, T.; Quiniou, R. (2008): Mining Temporal Patterns with Quantitative Intervals. In: 2008

IEEE International Conference on Data Mining Workshops (ICDMW). Pisa, Italy, pp. 218–227.

Han, J.; Lakshmanan, L.; Ng, R. T. (1999): Constraint-based, multidimensional data mining. In

Computer 32 (8), pp. 46–50. DOI: 10.1109/2.781634.

Handy, J. (1998): The Cache Memory Book. 2nd ed. San Diego: Academic Press.

Hashemi, A. H.; Kaeli, D. R.; Calder, B. (1997): Efficient procedure mapping using cache line

coloring. In M. Chen, R. K. Cytron, A. M. Berman (Eds.): ACM SIGPLAN 1997 conference. Las

Vegas, Nevada, United States, pp. 171–182.

Heuer, R. J., Jr.; Pherson, R. H. (2014): Structured Analytic Techniques for Intelligence Anal-

ysis. Edited by Randolph H. Pherson: CQ Press.

Hu, Y.-H.; Cheng, C.; Wu, F.; Yang, C.-I. (2010): Mining Multi-level Time-interval Sequential

Patterns in Sequence Databases. In G. Kou (Ed.): 2nd International Conference on Software

Engineering and Data Mining (SEDM 2010). Chengdu, China, 23 - 25 June 2010. International

Conference on Software Engineering and Data Mining (SEDM). Piscataway, NJ: IEEE,

pp. 416–421.

Hudry, J. L. (2004): Is Time in Physics Discrete, Dense, or Continuous? In: Proceedings of the

1th International Conference on the Ontology of Spacetime. 1th International Conference on

the Ontology of Spacetime. Montréal, Canada, May 11-14, 2004.

Hutchison, D.; Kanade, T.; Kittler, J.; Kleinberg, J. M.; Mattern, F.; Mitchell, J. C. et al. (Eds.)

(2006): Data Warehousing and Knowledge Discovery. Berlin, Heidelberg: Springer Berlin Hei-

delberg (Lecture Notes in Computer Science).

IBM Corporation (2013): Descriptive, predictive, prescriptive: Transforming asset and facilities

management with analytics. Choose the right data analytics solutions to boost service quality,

reduce operating costs and build ROI. (White paper (external)-USEN). IBM Corporation. Avail-

able online at http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=TIW14162USEN,

updated on October 2013, checked on 5/7/2015.

Jones, R.; Hosking, A.; Moss, E. (2012): The Garbage Collection Handbook. The Art of Auto-

matic Memory Management. Boca Raton, FL: CRC Press (Chapman & Hall/CRC applied al-

gorithms and data structures series).

Kamaev, V.; Finogeev, A.; Finogeev, A.; Shevchenko, S. (2014): Knowledge Discovery in the

SCADA Databases Used for the Municipal Power Supply System. In A. Kravets, M. Shcher-

bakov, M. Kultsova, T. Iijima (Eds.): Knowledge-Based Software Engineering, vol. 466. Cham:

Bibliography 225

Springer International Publishing (Communications in Computer and Information Science),

pp. 1–14.

Kaser, O.; Lemire, D. (2014): Compressed Bitmap Indexes: Beyond Unions and Intersections.

In CoRR abs/1402.4466.

Keim, D. (2010): Mastering the information age. Solving problems with visual analytics. Goslar:

Eurographics Association.

Keogh, E.; Ratanamahatana, C. A. (2005): Exact Indexing of Dynamic Time Warping. In

Knowledge and Information Systems 7 (3), pp. 358–386. DOI: 10.1007/s10115-004-0154-9.

Kimball, R.; Ross, M. (2002): The Data Warehouse Toolkit. The complete guide to dimensional

modeling. 2nd ed. New York: Wiley.

Kline, N.; Snodgrass, R. T. (1995): Computing Temporal Aggregates. In: Eleventh International

Conference on Data Engineering. Taipei, Taiwan, 6-10 March 1995, pp. 222–231.

Koncilia, C.; Morzy, T.; Wrembel, R.; Eder, J. (2014): Interval OLAP: Analyzing Interval Data.

In L. Bellatreche, M. K. Mohania (Eds.): Data Warehousing and Knowledge Discovery, vol.

8646. Cham: Springer International Publishing (Lecture Notes in Computer Science),

pp. 233–244.

Kostakis, O.; Papapetrou, P.; Hollmén, J. (2011): ARTEMIS: Assessing the Similarity of Event-

Interval Sequences. In D. Gunopulos, T. Hofmann, D. Malerba, M. Vazirgiannis (Eds.): Machine

Learning and Knowledge Discovery in Databases, vol. 6912: Springer Berlin Heidelberg (Lec-

ture Notes in Computer Science), pp. 229–244.

Kotsifakos, A.; Papapetrou, P.; Athitsos, V. (2013): IBSM: Interval-Based Sequence Matching.

65. In: Proceedings of the 2013 SIAM International Conference on Data Mining. SIAM Inter-

national Conference on Data Mining, pp. 596–604.

Kranjec, A.; Chatterjee, A. (2010): Are temporal concepts embodied? A challenge for cognitive

neuroscience. In Front Psychol 1, p. 240. DOI: 10.3389/fpsyg.2010.00240.

Kriegel, H.-P.; Pötke, M.; Seidl, T. (2001): Object-Relational Indexing for General Interval Re-

lationships. In G. Goos, J. Hartmanis, J. van Leeuwen, C. S. Jensen, M. Schneider, B. Seeger,

V. J. Tsotras (Eds.): Advances in Spatial and Temporal Databases, vol. 2121. Berlin, Heidel-

berg: Springer Berlin Heidelberg (Lecture Notes in Computer Science), pp. 522–542.

Kuhn, H. W. (1955): The Hungarian method for the assignment problem. In Naval Research

Logistics 2 (1-2), pp. 83–97. DOI: 10.1002/nav.3800020109.

226 Bibliography

Lammarsch, T.; Aigner, W.; Bertone, A.; Gärtner, J.; Mayr, E.; Miksch, S.; Smuc, M. (2009):

Hierarchical Temporal Patterns and Interactive Aggregated Views for Pixel-Based Visualiza-

tions. In: 13th International Conference Information Visualisation. Barcelona, Spain, pp. 44–

50.

Laxman, S.; Sastry, P. S. (2006): A survey of temporal data mining. In SADHANA - Academy

Proceedings in Engineering Sciences 31 (2), pp. 173–198.

Laxman, S.; Sastry, P. S.; Unnikrishnan, K. P. (2007): A fast algorithm for finding frequent epi-

sodes in event streams. In P. Berkhin, R. Caruana, X. Wu (Eds.): 13th ACM SIGKDD Interna-

tional Conference. San Jose, California, USA, pp. 410–419.

Lemire, D.; Kaser, O. (2011): Reordering Columns for Smaller Indexes. In Information Sci-

ences 181 (12), pp. 2550–2570. DOI: 10.1016/j.ins.2011.02.002.

Lemire, D.; Kaser, O.; Aouiche, K. (2010): Sorting improves word-aligned bitmap indexes. In

Data & Knowledge Engineering 69 (1), pp. 3–28. DOI: 10.1016/j.datak.2009.08.006.

Lenz, H.-J.; Shoshani, A. (1997): Summarizability in OLAP and statistical data bases. In: Pro-

ceedings of the 9th International Conference on Scientific and Statistical Database Manage-

ment, pp. 132–143.

Liu, H.; Hussain, F.; Tan, C.; Dash, M. (2002): Discretization: An Enabling Technique. In Data

Mining and Knowledge Discovery 6 (4), pp. 393–423. DOI: 10.1023/A:1016304305535.

Liu, M.; Rundensteiner, E.; Greenfield, K.; Gupta, C.; Wang, S.; Ari, I.; Mehta, A. (2011): E-

Cube: multi-dimensional event sequence analysis using hierarchical pattern query sharing. In

T. Sellis, R. J. Miller (Eds.): Proceedings of the 2011 ACM SIGMOD International Conference

on Management of data. Athens, Greece, pp. 889–900.

Liu, Z.; Jiang, B.; Heer, J. (2013): imMens. Real-time Visual Querying of Big Data. In Computer

Graphics Forum 32 (3pt4), pp. 421–430. DOI: 10.1111/cgf.12129.

Lorentzos, N. A.; Mitsopoulos, Y. G. (1997): SQL Extension for Interval Data. In IEEE Trans.

Knowl. Data Eng. 9 (3), pp. 480–499. DOI: 10.1109/69.599935.

Mansmann, S.; Scholl, M. H. (2006): Extending Visual OLAP for Handling Irregular Dimen-

sional Hierarchies. In D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mit-

chell et al. (Eds.): Data Warehousing and Knowledge Discovery, vol. 4081. Berlin, Heidelberg:


Martin, R. C. (2009): Clean Code. A Handbook of Agile Software Craftsmanship. Upper Saddle

River, NJ: Prentice Hall (Robert C. Martin series).

Bibliography 227

Mazón, J.-N.; Lechtenbörger, J.; Trujillo, J. (2008): Solving Summarizability Problems in Fact-

Dimension Relationships for Multidimensional Models. In: Proceedings of the ACM 11th Inter-

national Workshop on Data Warehousing and OLAP. Napa Valley, California, USA. New York,

NY, USA: ACM (DOLAP ’08), pp. 57–64.

Mazón, J.-N.; Lechtenbörger, J.; Trujillo, J. (2009): A survey on summarizability issues in mul-

tidimensional modeling. In Data & Knowledge Engineering 68 (12), pp. 1452–1469. DOI:

10.1016/j.datak.2009.07.010.

Mazón, J.-N.; Lechtenbörger, J.; Trujillo, J. (2011): A Model-Driven Approach for Enforcing

Summarizability in Multidimensional Modeling. In D. Hutchison, T. Kanade, J. Kittler, J. M.

Kleinberg, F. Mattern, J. C. Mitchell et al. (Eds.): Advances in Conceptual Modeling. Recent

Developments and New Directions, vol. 6999. Berlin, Heidelberg: Springer Berlin Heidelberg

(Lecture Notes in Computer Science), pp. 65–74.

Meisen, P.; Keng, D.; Meisen, T.; Recchioni, M. (2015a): TIDAQL: A Query Language enabling

On-line Analytical Processing of Time Interval Data. In: Proceedings of 17th International Con-

ference on Enterprise Information Systems (ICEIS2015). 17th International Conference on

Enterprise Information Systems (ICEIS2015). Barcelona, Spain, 27.-30.04. INSTICC.

Meisen, P.; Keng, D.; Meisen, T.; Recchioni, M.; Jeschke, S. (2015b): Bitmap-Based On-Line

Analytical Processing of Time Interval Data. In Shahram Latifi (Ed.): Proceedings of the 12th

International Conference on Information Technology: New Generations (ITNG), 2015.

Meisen, P.; Recchioni, M.; Meisen, T.; Schilberg, D.; Jeschke, S. (2014): Modeling and pro-

cessing of time interval data for data-driven decision support. In: 2014 IEEE International

Conference on Systems, Man and Cybernetics - SMC. San Diego, CA, USA, October 5-8,

2014. San Diego, California, USA, pp. 2946–2953.

Meisen, T.; Meisen, P.; Schilberg, D.; Jeschke, S. (2012): Adaptive Information Integration:

Bridging the Semantic Gap between Numerical Simulations. In W. van der Aalst, J. Mylopou-

los, M. Rosemann, M. J. Shaw, C. Szyperski, R. Zhang et al. (Eds.): Enterprise Information

Systems, vol. 102. Berlin, Heidelberg: Springer Berlin Heidelberg (Lecture notes in business

information processing), pp. 51–65.

Mendoza, M.; Alegría, E.; Maca, M.; Cobos, C.; León, E. (2015): Multidimensional analysis

model for a document warehouse that includes textual measures. In Decision Support Sys-

tems 72, pp. 44–59. DOI: 10.1016/j.dss.2015.02.008.

Merriam-Webster (2015): Analysis - merriam-webster.com. Available online at

http://www.merriam-webster.com/dictionary/analysis, checked on 4/19/2015.

228 Bibliography

Moerchen, F. (2006a): Algorithms for Time Series Knowledge Mining. In: Proceedings of the

12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New

York, NY, USA: ACM (KDD ’06), pp. 668–673.

Moerchen, F. (2006b): Time series knowledge mining. Marburg: Görich & Weiershäuser (Wis-

senschaft in Dissertationen, 813).

Moerchen, F. (2009): Tutorial CIDM-T Temporal pattern mining in symbolic time point and time

interval data. In: Computational Intelligence and Data Mining, 2009. CIDM ’09. IEEE Sympo-

sium on, p. xiv.

Moerchen, F.; Fradkin, D. (2010): Robust mining of time intervals with semi-interval partial

order patterns. In S. Parthasarathy, B. Liu, B. Goethals, J. Pei, C. Kamath (Eds.): Proceedings

of the 2010 SIAM International Conference on Data Mining. Philadelphia, PA: Society for In-

dustrial and Applied Mathematics, pp. 315–326.

Niemi, T.; Niinimäki, M.; Thanisch, P.; Nummenmaa, J. (2014): Detecting summarizability in

OLAP. In Data & Knowledge Engineering 89, pp. 1–20. DOI: 10.1016/j.datak.2013.11.001.

Oracle Cooperation (2015): Oracle Technology Global Price List. Software Investment Guide.

Oracle Cooperation. Available online at http://www.oracle.com/us/corporate/pricing/

technology-price-list-070617.pdf, updated on April 2015, checked on 6/14/2015.

Ossimitz, G.; Mrotzek, M. (2008): The Basics of System Dynamics: Discrete vs. Continuous

Modelling of Time. In B. G. Dangerfield (Ed.): Proceedings of the 2008 International Confer-

ence of the System Dynamics Society.

Pak Chung Wong; Thomas, J. (2004): Visual Analytics. In IEEE Comput. Grap. Appl. 24 (5),

pp. 20–21. DOI: 10.1109/MCG.2004.39.

Papapetrou, P.; Kollios, G.; Sclaroff, S. (2005): Discovering frequent arrangements of temporal

intervals. In: Proceedings of the 5th IEEE International Conference on Data Mining (ICDM’05):

IEEE Press, pp. 354–361.

Papapetrou, P.; Kollios, G.; Sclaroff, S.; Gunopulos, D. (2009): Mining frequent arrangements

of temporal intervals. In Knowledge and Information Systems 21 (2), pp. 133–171. DOI:

10.1007/s10115-009-0196-0.

Paramonov, V.; Fedorov, R.; Ruzhnikov, G.; Shumilov, A. (2013): Web-Based Analytical Infor-

mation System for Spatial Data Processing. In T. Skersys, R. Butleris, R. Butkiene (Eds.): In-

formation and Software Technologies, vol. 403. Berlin, Heidelberg: Springer Berlin Heidelberg

(Communications in Computer and Information Science), pp. 93–101.

Bibliography 229

Pascoe, C. (2011): Time, technology and leaping seconds. Available online at http://google-

blog.blogspot.de/2011/09/time-technology-and-leaping-seconds.html, updated on 9/15/2011,

checked on 4/24/2015.

Pedersen, T. B.; Jensen, C. S.; Dyreson, C. E. (1999): Extending Practical Pre-Aggregation in

On-Line Analytical Processing. In M. Atkinson (Ed.): Proceedings of the Twenty-fifth Interna-

tional Conference on Very Large Databases, Edinburgh, Scotland, UK, 7-10 September, 1999.

Orlando, FL: Morgan Kaufmann, pp. 663–674.

Pederson, T. B. (2000): Aspects of Data Modeling and Query Processing for Complex Multidi-

mensional Data. Ph.D. thesis. 4 volumes: Aalborg Universitetsforlag, Aalborg. Publication: De-

partment of Computer Science.

Peter, S.; Höppner, F. (2010): Finding Temporal Patterns Using Constraints on (Partial) Ab-

sence, Presence and Duration. In D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mat-

tern, J. C. Mitchell et al. (Eds.): Knowledge-Based and Intelligent Information and Engineering

Systems, vol. 6276. Berlin, Heidelberg: Springer Berlin Heidelberg (Lecture Notes in Compu-

ter Science), pp. 442–451.

Power, D. (2001): What are Analytical Information Systems? In DSS News 2 (22).

Power, D. (2012): What is analytics? In DSS News 13 (18).

Rego, H.; Mendes, A. B.; Guerra, H. (2015): A Decision Support System for Municipal Budget

Plan Decisions. In A. Rocha, A. M. Correia, S. Costanzo, L. P. Reis (Eds.): New Contributions

in Information Systems and Technologies, vol. 354. Cham: Springer International Publishing

(Advances in Intelligent Systems and Computing), pp. 129–139.

Rind, A.; Lammarsch, T.; Aigner, W.; Alsallakh, B.; Miksch, S. (2013): TimeBench: A Data

Model and Software Library for Visual Analytics of Time-Oriented Data. In IEEE Trans Vis

Comput Graph 19 (12), pp. 2247–2256. DOI: 10.1109/TVCG.2013.206.

Roddick, J. F.; Mooney, C. H. (2005): Linear temporal sequences and their interpretation using

midpoint relationships. In Knowledge and Data Engineering, IEEE Transactions on 17 (1),

pp. 133–135. DOI: 10.1109/TKDE.2005.12.

Roh, J.-w.; Hwang, S.-w.; Yi, B.-K. (2012): Efficient bitmap-based Indexing of time-based In-

terval Sequences. In Information Sciences 194, pp. 38–56. DOI: 10.1016/j.ins.2011.08.013.

Ronald Weiss (2012): A Technical Overview of the Oracle Exadata Database Machine and

Exadata Storage Server. (White Paper). Oracle Cooperation (Oracle White Paper). Available

online at http://www.oracle.com/technetwork/database/exadata/exadata-technical-whitepa-

per-134575.pdf, updated on June 2012, checked on 6/14/2015.

230 Bibliography

Russo, M.; Ferrari, A. (2011): The Many-to-Many Revolution 2.0. (White Paper) (Version 2.0,

Revision 1). Available online at http://www.sqlbi.com/wp-content/uploads/The_Many-to-

Many_Revolution_2.0.pdf, updated on 10/10/2011, checked on 5/11/2015.

Sadasivam, R.; Duraiswamy, K. (2013): Efficient approach to discover interval-based sequen-

tial patterns 9 (2), pp. 225–234. DOI: 10.3844/jcssp.2013.225.234.

Schutt, R.; O'Neil, C. (2014): Doing data science. First edition. Sebastopol, CA: O'Reilly Media,

Inc.

Shneiderman, B. (1996): The Eyes Have It: A Task by Data Type Taxonomy for Information

Visualizations. In: IEEE Symposium on Visual Languages. Boulder, CO, USA, 3-6 Sept. 1996,

pp. 336–343.

Snodgrass, R. T. (1995): The TSQL2 Temporal Query Language. New York: Springer.

Song, I.-Y.; Medsker, C.; Ewen, E.; Rowen, W. (2001): An Analysis of Many-to-Many Relation-

ships Between Fact and Dimension Tables in Dimensional Modeling. In: Proceedings of the

Int’l Workshop on Design and Management of Data Warehouses, p. 6.

Sorin, D. J.; Hill, M. D.; Wood, D. A. (2011): A Primer on Memory Consistency and Cache

Coherence. Synthesis Lectures on Computer Architecture. [San Rafael, Calif.]: Morgan &

Claypool Publishers (Synthesis lectures on computer architecture, 16).

Spaccapietra, S.; Zimányi, E.; Song, I.-Y. (2009): Journal on data semantics XIII. Berlin:

Springer-Verlag (Lecture Notes in Computer Science, 5530).

Spofford, G. (2006): MDX Solutions with Microsoft SQL Server Analysis Services 2005 and

Hyperion Essbase. 2nd ed. Indianapolis, IN: Wiley Pub.

Stockinger, K.; Wu, K.; Shoshani, A. (2004): Evaluation Strategies for Bitmap Indices with Bin-

ning. In D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mitchell et al. (Eds.):

Database and Expert Systems Applications, vol. 3180. Berlin, Heidelberg: Springer Berlin

Heidelberg (Lecture Notes in Computer Science), pp. 120–129.

Stroh, F.; Winter, R.; Wortmann, F. (2011): Method Support of Information Requirements Anal-

ysis for Analytical Information Systems. In Bus Inf Syst Eng 3 (1), pp. 33–43. DOI:

10.1007/s12599-010-0138-0.

Tao, Y.; Papadias, D.; Faloutsos, C. (2004): Approximate temporal aggregation. In: Proceed-

ings. 20th International Conference on Data Engineering. Boston, MA, USA, 30 March-2 April

2004, pp. 190–201.

Teiken, Y. (2012): Automatic model driven analytical information systems. Berlin: Logos-Verl.

Bibliography 231

Thomas, J. J.; Cook, K. A. (2005): Illuminating the Path. Los Alamitos, California: IEEE Com-

puter Society.

Toman, D. (2000): SQL/TP: A Temporal Extension of SQL. In G. Kuper, L. Libkin, J. Paredaens

(Eds.): Constraint Databases. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 391–399.

van Schaik, S. J.; Moor, O. de (2011): A memory efficient reachability data structure through

bit vector compression. In T. Sellis, R. J. Miller (Eds.): Proceedings of the 2011 ACM SIGMOD

International Conference on Management of data. Athens, Greece, pp. 913–924.

van Wijk, J. J.; van Selow, E. R. (1999): Cluster and calendar based visualization of time series

data. In: IEEE Symposium on Information Visualization (InfoVis'99). San Francisco, CA, USA,

24-29 Oct. 1999, pp. 4–9.

Wang, Y.; Ye, X. (2014): Index-Based OLAP Aggregation for In-Memory Cluster Computing.

In: 2014 International Conference on Cloud Computing and Big Data (CCBD). Wuhan, China,

pp. 148–151.

Whibberley, P. B.; Davis, J. A.; Shemar, S. L. (2011): Local representations of UTC in national

laboratories. In Metrologia 48 (4), pp. S154-S164. DOI: 10.1088/0026-1394/48/4/S05.

White, C. (2005): Data Integration: Using ETL, EAI, and EII Tools to Create an Integrated

Enterprise. TDWI (November). Available online at http://download.101com.com/tdwi/re-

search_report/DIRR_Report.pdf, updated on November 2005, checked on 5/12/2015.

Winarko, E.; Roddick, J. F. (2007): ARMADA – An algorithm for discovering richer relative

temporal association rules from interval-based data. In Data & Knowledge Engineering 63 (1),

pp. 76–90. DOI: 10.1016/j.datak.2006.10.009.

Wu, K.; Ahern, S.; Bethel, E. W.; Chen, J.; Childs, H.; Cormier-Michel, E. et al. (2009): FastBit.

Interactively Searching Massive Data. In J. Phys.: Conf. Ser. 180, p. 12053. DOI:

10.1088/1742-6596/180/1/012053.

Yang, J.; Widom, J. (2003): Incremental computation and maintenance of temporal aggre-

gates. In The VLDB Journal The International Journal on Very Large Data Bases 12 (3),

pp. 262–283. DOI: 10.1007/s00778-003-0107-z.

Zhang, D.; Markowetz, A.; Tsotras, V.; Gunopulos, D.; Seeger, B. (2001): Efficient computation

of temporal aggregates with range predicates. In P. Buneman (Ed.): The 12th ACM SIGMOD-

SIGACT-SIGART Symposium. Santa Barbara, California, United States, pp. 237–245.

Zhang, D.; Markowetz, A.; Tsotras, V. J.; Gunopulos, D.; Seeger, B. (2008): On computing tem-

poral aggregates with range predicates. In ACM Trans. Database Syst. 33 (2), pp. 1–39. DOI:

10.1145/1366102.1366109.

232 Bibliography

Zhou, S. (2010): An Efficient Simulation Algorithm for Cache of Random Replacement Policy.

In D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mitchell et al. (Eds.):

Network and Parallel Computing, vol. 6289. Berlin, Heidelberg: Springer Berlin Heidelberg

(Lecture Notes in Computer Science), pp. 144–154.

Zimányi, E. (2006): Temporal aggregates and temporal universal quantification in standard

SQL. In SIGMOD Rec. 35 (2), pp. 16–21. DOI: 10.1145/1147376.1147379.

Philipp Meisen Analyzing Time Interval Datarepo.desakupemalang.id/materi/Internet-of-Things/Analyzing Time... · Common systems are not capable to analyze these amounts of time inter-

Documents