Top Banner
Fakult¨ at f¨ ur Elektrotechnik und Informatik Institut f¨ ur Praktische Informatik Fachgebiet Datenbanken und Informationssysteme Optimizing parameters for similarity-based spatial matching Masterarbeit im Studiengang Informatik Joseph Eid Matrikelnummer: 2813160 Pr¨ ufer: Prof. Dr. Udo Lipeck Zweitpr¨ ufer: Dr. Hans Hermann Br¨ uggemann Betreuer: M. Sc. Michael Sch¨ afers 22. Oktober 2013
101

Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Aug 29, 2019

Download

Documents

vanbao
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Fakultat fur Elektrotechnik und InformatikInstitut fur Praktische Informatik

Fachgebiet Datenbanken und Informationssysteme

Optimizing parameters for similarity-basedspatial matching

Masterarbeitim Studiengang Informatik

Joseph EidMatrikelnummer: 2813160

Prufer: Prof. Dr. Udo LipeckZweitprufer: Dr. Hans Hermann Bruggemann

Betreuer: M. Sc. Michael Schafers

22. Oktober 2013

Page 2: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has
Page 3: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Zusammenfassung

Im Rahmen einer großeren Doktorarbeit uber das Matching raumlicher Daten wurde einahnlichkeitsbasierter Matchingalgorithmus entwickelt, der sich uber verschiedene Para-meter konfigurieren lasst. Dies sind zum einen die Gewichte der verwendeten Ahnlich-keitsmaße und zum anderen ein Schwellwert, ab welchem Ahnlichkeitswert ein Matchingbestatigt werden kann. Diese Parameter wurden bisher von einem menschlichen Expertenempirisch ermittelt und sollen im Rahmen dieser Arbeit automatisch optimiert werden- auch fur bisher unbekannte Daten.

Maschinelles Lernen, insbesondere Klassifikation, wird in dieser Arbeit verwendet, umeinen Matcher zu lernen, der diese Anforderungen erfullt. Der Klassifikator soll Gewich-te und einen Schwellwert fur die Anwendung in dem ahnlichkeitsbasierten Algorithmusberechnen. Hierbei wird die Technik des “Active Learning” eingesetzt, bei der das Lern-verfahren die jeweils interessantesten Entscheidungen durch einen menschlichen Lehrerdurchfuhren lasst. Auf diese Weise kann iterativ ein Trainingsdatensatz aufgebaut wer-den, der in der Regel deutlich kleiner ist, als ein komplett vorab erstellter Trainingsda-tensatz.

Der Optimierer und das Matchingverfahren werden integriert. Besonderheiten der raum-lichen Daten werden berucksichtigt, um den gelernten Klassifikator zu verbessern: Nichtnur eins-zu-eins (1:1) Matchings sondern auch Matchings hoherer Kardinalitat (n:m)werden verwendet. Die Ergebnisse zeigen, dass attributbasierte Ahnlichkeitsmaße gutoptimiert werden konnen. Relationale Ahnlichkeitsmaße mussen aufgrund benotigterVoraussetzungen besonders behandelt werden, damit sie sich effizient lernen lassen.

Page 4: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has
Page 5: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Abstract

In a larger Ph.D. project about spatial data matching, a similarity-based matching algo-rithm has several parameters. These are the weights of the used similarity measures anda similarity threshold separating matches from non-matches. The parameters need tobe optimized for previously unknown data sets being matched. Usually, an experiencedhuman empirically performs the parameter optimization process. The purpose of thisthesis is to automatically optimize these parameters.

Machine learning, in particular classification, is utilized to learn a matcher in order tomeet this special need. The classifier shall calculate the weights and the threshold to beused in the similarity-based algorithm. “Active learning” is employed to let the learningalgorithm choose the data from which it learns. The experienced human becomes ateacher answering the queries of the machine. Active learning allows to learn the classifieriteratively from a training data set which is relatively small.

The optimizer project and the larger similarity-based project are integrated. Spatialdata matching issues are considered in order to improve the learned classifier: not only1:1 matches but also m:n matches are utilized. The results show that attribute-basedparameters can be optimized with active learning. Relational parameters need specialtreatment in order to be learned efficiently.

Page 6: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has
Page 7: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Contents

1 Introduction 1

2 Foundations 5

2.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Machine learning settings . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Binary classification . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.3 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.4 Support vector machines (SVM) . . . . . . . . . . . . . . . . . . . 15

2.1.5 Soft margin SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.1 Data access scenarios . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.2 Query selection frameworks . . . . . . . . . . . . . . . . . . . . . 24

2.3 Data matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.1 Spatial data matching . . . . . . . . . . . . . . . . . . . . . . . . 30

2.3.2 SimMatching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 System Design 35

3.1 The intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 The comparison vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 The optimizer as SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4 The active optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5 Integrating the optimizer and SimMatching . . . . . . . . . . . . . . . . 42

3.6 The optimizer algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.7 Seed generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

i

Page 8: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

3.8 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 Implementation 47

4.1 System requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3 Detailed system design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.4.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.4.2 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4.3 Seed generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4.4 SVM learning using RapidMiner . . . . . . . . . . . . . . . . . . 56

4.4.5 Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5 Results 65

5.1 The experiments on Link 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.1.1 Equal initial weights for seed generation . . . . . . . . . . . . . . 67

5.1.2 Empirical initial weights for seed generation . . . . . . . . . . . . 70

5.2 The experiment on Link 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3 The experiment on Link 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6 Conclusion 77

A RapidMiner Application GUI 79

B SimMatching Related Classes 83

ii

Page 9: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Chapter 1

Introduction

Various organizations and companies are gathering increasing amounts of informationover the years. To take full advantage of the stored data, it needs to be processed,merged and analysed. On the other hand, efficient ways are required to deal with thehuge volumes of the available data. Thus, many information management projects havebeen launched to integrate data from multiple sources. Data integration is gaining moreand more interest in recent years. Data matching [Chr12] is an essential step in dataintegration. It is about finding those entities in the data that represent the same real-world entity. The matched entities are linked with each other for the next step likedata fusion. Usually data matching is applied to records from two different databasesto identify common entities. The entities (records) can represent patients, employees,products, etc. Records of one single database can also be matched to identify duplicates(deduplication). A pair of matched records (or duplicates) is called a match. The taskof automated data matching is very challenging. The decision whether or not a pair ofrecords is a match should be based not only on their attribute values but also on the“semantics” behind. It may be easy, however, for a “knowledgeable” human to identifymatches from their attribute values. Computational complexity makes the problem evenharder, because theoretically each record should be compared to all other records inorder to identify matches. The goal is to efficiently achieve the best matching accuracy.Scaling to very large databases is another important goal but achieving high accuracyis the most critical.

Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has spatial attributes in addition to the usual non-spatialattributes. These spatial attributes represent the geographical and geometrical natureof the spatial object. Different spatial object types have different spatial attributes.Geographical coordinates, for instance, are traditional spatial attributes among others.Here, only one type of spatial objects will be considered to illustrate spatial data match-ing: linear spatial objects representing roads of two maps stemming from two different

1

Page 10: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

sources are matched. Spatial data matching has particular requirements that need tobe considered. Usually data matching matches just one record to another one 1:1. Thesame road in reality, however, might be stored as m different line-objects (segments) inone dataset, but as n different line-objects in an another dataset. As a result, spatialdata matching should utilize m:n matches in addition to 1:1 matches. Topology of thespatial objects plays an important role in spatial data matching as well. Graph-basedspatial data matching [TL07] utilizes the topology to improve matching accuracy. Forinstance, the matching results of the side roads can greatly help in deciding whetheror not two roads being compared are matched. Compared spatial objects which havematched neighbors are more likely to be matches.

Researchers from various fields in computer science and statistics have proposed manytechniques to conduct data matching. A traditionally well-known technique is to apply“distance measures” to the records being compared. The resulting “distance values” be-tween records can be utilized to find matches. The records very close to each other areconsidered matches. Because the distance between a pair of records can be converted intothe “similarity” of the pair, it is called similarity score. Thus, this technique is knownas similarity-based data matching [Chr12]. Many similarity measures [CMZ09] shouldbe applied to attribute values. Similarity measures transform the similarity between twoattribute values into a normalized number between 0 and 1. One way to calculate theoverall similarity score of a pair of records is to use a weighted sum of the attributevalues’ similarity scores. If the overall similarity score is higher than a certain similaritythreshold the pair of records is considered a match. The weights and the threshold insimilarity-based matching can be thought of as parameters of the data matching alg-orithm. In order to achieve high matching accuracy, the parameters may need to beoptimized for every application domain or different input data.

Machine learning [MRT12] has proposed many techniques to solve the data matchingproblem, e.g., classification [Chr12]. Classification-based learners are systems that usetraining data to build or learn a base model called the learned (base) classifier. A classifieris a model that takes a record (entity) as input and assigns a label to it from a predefinedset of labels (classes). In data matching each pair of compared records can be representedas a comparison vector of similarity scores. The classifier, in this case, takes a pair ofrecords (comparison vector) as input and returns whether or not the pair is a match.Thus, the output of the learner system is a classifier that can label any unlabeledpair of records (any comparison vector) as match or non-match. The records of thetraining data, on the other hand, are already labeled by a human annotator. This is oneof the traditional ways to use machine learning to match data. Traditionally no humaninterrupts the learning process while the classifier is being trained. The human annotatoronly has to label the training data in advance and feed it to the learner. The learneris passive in this case. After viewing the data matching problem as a classification one,the accuracy of the data matching is measured from now on by the accuracy of theclassification. Decision trees [Qui86] and support vector machines (SVM) [Bur98] arewell-known base classifiers which are widely used to fulfil the task of data matching.

2

Page 11: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

New machine learning approaches are trying to insert a teacher in the learning processby allowing the learner to be active [Set12] and ask questions to the teacher. The teacheris usually an experienced human capable of giving the right answer. The intuition is toallow the learner to query the most useful pairs of records to learn from. Thus, thelearned classifier can achieve high accuracy while requiring minimum amount of trainingdata. Actually, the problem with passive classifiers is not in the training data size itself,but in the effort required to obtain this training data which is usually quite big.

This thesis is part of a larger Ph.D. project [Sch13] for similarity-based spatial datamatching. In order to achieve high matching accuracy, it is required to manually “tune”the parameters of the matching algorithm for every new spatial data sources (maps)being matched. Here, a base classifier is actively learned in order to automaticallycalculate the weights and the threshold to be used in the larger matching algorithm.The optimized parameters shall efficiently help the larger matching algorithm toachieve higher accuracy.

This work is divided into six chapters. After the introduction in chapter 1, chapter 2presents the foundations of machine learning and data matching. The concepts of super-vised and unsupervised machine learning are explained. Active learning is demonstratedas well as two major approaches to conduct it. The chapter ends with a description ofdata matching, spatial data matching and the larger similarity-based algorithm. Chap-ter 3 explains the design of our newly developed active learner system in detail. Designdecisions are clarified and alternatives are discussed. Chapter 4 covers system implemen-tation and explains the used libraries. The experiments and the results are presented inchapter 5. Chapter 6 contains a summary, the conclusion and future work that may bedone.

3

Page 12: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has
Page 13: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Chapter 2

Foundations

This chapter provides background knowledge required to understand the solution pre-sented in chapter 3. First, machine learning is introduced by essentially inheriting fromthe book titled “Machine Learning” and authored by Peter Flach [Fla12]. The settingsin which machine learning may be applied, supervised and unsupervised, are describedand compared. The binary classification task in machine learning is defined. Two classi-fication models, decision trees and support vector machines are introduced. Next, activelearning is presented based on Burr Settles’ book “Active Learning” [Set12]. Activelearning is a special kind of supervised machine learning that can be very advantageousin some cases. Three scenarios are distinguished, in which an active learning algorithmmay access the data to learn from. Major active learning approaches are investigated.Finally, data matching is presented as a task consisting of several steps. One step, theclassification, is given special interest. Particular requirements of spatial data matchingare discussed and the larger similarity-based algorithm is described.

2.1 Machine learning

Machine learning [Fla12] is a branch of Artificial Intelligence (AI) concerned with study-ing and making systems that can automatically learn from data and improve with ex-perience. Learning in this context is not learning by heart but recognizing patterns andmaking “sensible” decisions based on data. Experience, here, refers to past informationavailable to the learner. The difficulty lies in the complexity of describing all possibledecisions given by all possible input [Kap]. To address this problem machine learninginfers from data based on predefined computational principles. The machines extractknowledge from data presented to them in order to predict for unseen future data. Pre-diction is the main objective of machine learning and it has to be done using efficientand robust algorithms.

5

Page 14: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

This section introduces traditional machine learning, in which the learning algorithmslearn by themselves from data. They need no human interaction during run time. Evenin situations where a human is needed, as we will see later, she/he will be only involvedbefore the “learning process” starts. The learning algorithm is not permitted to interruptthe “learning process”. So one can say that the algorithm learns just by “listening”. It isnot allowed to “ask questions” to the human (if exists). This kind of learning is calledpassive and as a result, traditional machine learning is called passive learning. Activelearning, as we will see in section 2.2, gives the learning algorithm the opportunity toask questions and hence to be active1.

Flach emphasizes in his book [Fla12] on three “ingredients” of machine learning: tasks,models and features. These come in many different forms and should be chosen carefullyin order to apply machine learning successfully.“Machine learning is concerned with usingthe right features to build the right models that achieve the right tasks” according toFlach [Fla12]. Tasks are the problems that can be solved using machine learning. Machinelearning can be used to solve many tasks, e.g., classification, regression and clustering, toname only a few. Classification classifies data instances into several classes or categories.Each data instance should be assigned a “value” called its label representing the classof the instance. For example, incoming e-mails can be classified into politics, business,sport or culture. Regression can assign any real value to each instance, which can beclose or far from its true (real) value. Clustering partitions instances into homogeneousgroups of “similar” instances. Each group is called a cluster and should be “dissimilar”to the other clusters.

Models are what is being learned from data in order to solve a given task. They are theoutput of machine learning algorithms. One should distinguish here between learningproblems and tasks. Learning problems are solved by learning algorithms that producemodels. Tasks, on the other hand, are solved by models. Support Vector Machines (sec-tion 2.1.4) and decision trees (section 2.1.3) are popular models which can solve the clas-sification task. Models are mappings from features to solutions of the tasks (the output).Based on the output, two types of models can be separated: predictive and descriptive.Predictive models, as the name suggests, are used for prediction. A base classifier canpredict the class of an instance. Descriptive models, however, uncover hidden structuresin the data which may be useful. Clustering e-mails can reveal, for example, that businesse-mails are “similar” to politics rather than sport ones.

Models are built on features, which describe the related aspects of the data instancesto a particular machine learning application. The models are only as good as theirfeatures. Features do not always come with data and they often need to be constructedor transformed. Feature construction is crucial for the success of a machine learningapplication. When classifying e-mails, for example, each e-mail may be represented asone feature storing its text. A better alternative, however, is to represent an e-mail as a

1In educational techniques, active learning indicates that learners must not only listen but also read,write, discuss and take responsibility in order to learn.

6

Page 15: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

lot of features each storing the frequency of one word in the e-mail. The set of all possibledata instances is called the instance space denoted by X. A feature f , mathematically,is a function mapping from the input space to the set of feature values called the featuredomain F . The feature domain can be the set of real numbers, integers, booleans or anyfinite set, e.g., a set of colors. The frequency of a word in an e-mail message, for example,is an integer feature. If the input space is described by a fixed number of features, thenwe have X = F1 × F2 × ... × Fd. Hence each data instance is represented as a featurevector consisting of d feature values. From now on, the words data instance, featurevector and record are used as synonyms in this thesis.

2.1.1 Machine learning settings

Machine learning can be applied in two main settings: supervised and unsupervised. Usu-ally prediction is done using supervised machine learning whereas unsupervised learningis used for knowledge discovery. There are, however, unsupervised algorithms which areable to learn predictive models and supervised ones to learn descriptive models. Semi-supervised machine learning has been introduced as a hybrid way of the two settingsmentioned earlier but it is not covered by this thesis. Active learning is a relatively newfashion of machine learning which has been proposed as a special kind of supervisedlearning (see section 2.2).

Supervised machine learning

This is the most common setting in machine learning, and probably what is meant mosttimes when referring to (supervised) machine learning. Supervised machine learning al-gorithms are usually used to solve the problem of learning predictive models. A predictivemodel is a function y : X → Y from the instance space X to the output space Y . Itpredicts an output value y(x) for each input feature vector x ∈ X. The true outputvalue of an instance x, which should be predicted, is denoted by y(x). The function y isan approximation of the unknown output function y. The models are learned, however,using a set of labeled data instances called the training data denoted by Dtrain. It isused to say that the learning algorithms train the model. Supervised learning modelsgeneralise from training data to unseen future data. Each training instance x ∈ Dtrain isassigned a label l(x) by the labeling function l. The labeling function l : X → L mapsthe instance space X to the label space L. Thus training instances can be seen as pairs(x, l(x)). Training instances are also referred to as examples in literature.

The training data is usually labeled by a human annotator, who has “enough” knowledgeto give the true labeling for each training record. But first, the training data has to besampled (selected) from the whole unlabeled data, that need to be labeled later. Theintuition behind is that training data should represent the distribution of the original

7

Page 16: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

data. The human annotator acts as a supervisor or teacher supplying the learning processwith the true labels of the selected instances. Another set of labeled records calledtesting data2 Dtest should be used to test the prediction accuracy of the learned model.Alternatively a subset of the training data can be cut off and not used in training, so thatit can be used for testing purposes later on. The goal of supervised learning, however,is to train a model whose prediction accuracy is high on previously unseen instances,and not on the testing instances. Figure 2.1 illustrates supervised machine learning ofpredictive models.

Figure 2.1: Supervised machine learning; source [Fla12] updated

Obtaining good training data is very challenging when applying supervised machinelearning. Good training data shall train a model with high prediction accuracy. To dothis, training data shall represent the characteristics of the data that need to be labeledas much as possible. That is why training data is sampled from the data which need tobe labeled later. The size of the training data |Dtrain| = n is relatively small (hundredsof records) compared to the data from which it is sampled. Sampling and labeling thetraining instances, however, is tedious for humans, time-consuming and may be expensiveas well. High-quality training data is crucial to the accuracy and thus the success ofsupervised machine learning. Active learning approaches introduced in section 2.2 mightsolve this problem.

2Also labeled by a human annotator.

8

Page 17: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Unsupervised machine learning

Unsupervised machine learning algorithms deal with unlabeled data only. No trainingnor testing data set exists in this setting. The whole unlabeled data is made available tothe learning algorithms to learn from. Since there are no labels, there is no supervisorand thus the learning is unsupervised. Unsupervised learning algorithms are usually usedto solve the learning problem of descriptive models, i.e., describing data. This is usuallythe setting referred to when talking about data mining. Descriptive models discoverhidden knowledge in the data. The learned model is a description of the unlabeled data.Describing data has become the output of the task and at the same time the “output” ofthe descriptive model. Hence the task and the model coincide in (unsupervised) learningof descriptive models. Figure 2.2 illustrates unsupervised learning of descriptive modelsaccording to Flach [Fla12].

Figure 2.2: Unsupervised machine learning; source [Fla12]

Clustering is one task, among many others, that can be solved with unsupervised learn-ing. The output is a partitioning of the data instance space X into subsets {X1, ..., XK}where each subset Xk ⊆ X is called a cluster. Clustering can be then defined as anequivalence relation q ⊆ X ×X. Good clustering results in “coherent” and “decoupled”clusters. If this is the case, any instance is more similar to other instances in its owncluster rather than any other instance in another cluster.

This criterion requires defining similarity measures of instances. A similarity measureasserts the similarity of two data instances as a similarity score. A more useful measuremay be the dissimilarity of data instances. Dissimilarity can be expressed with distancemeasure between the instances. If the input space can be represented geometrically,euclidean distance would be then the most obvious example of distance measures. Sinceactive learning (section 2.2) is a supervised learning approach, unsupervised learning isout of our interest and will be not discussed further.

9

Page 18: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

2.1.2 Binary classification

Classification may be probably the most common task in machine learning. It is solvedby a predictive model learned from training data with a supervised learning algorithm.The learned classification model is called the learned (base) classifier. In the classificationtask, the output space is equal to the label space Y = L. As a result, the output of theclassifier is a label of the class labels set L = {C1, C2, ..., Ck}. We say that the classifierlabels data instances. The classes are usually distinct and |L| is small. In classification,and in the rest of this thesis, labeling and classification shall be understood as synonyms.3

The classifier is a function y : X → L mapping instances to labels. The true label of aninstance, however, is given by the function y : X → L. As a result, the training data isa set of pairs contains an instance and its label (x, y(x)).

If only two classes are available, the problem is called binary classification, e.g., classifyingreceived e-mails as spam or not-spam. The two classes in binary classification may bealso referred to as the positive (y1 = +1) and the negative class (y2 = −1). Sometimes,symbol ⊕ is used to refer to positives and for negatives. Two popular base classifiers,which can be learned with supervised learning algorithms, are decision trees (section2.1.3) and support vector machines (section 2.1.4).

The performance of binary classifiers is measured on the testing set Dtest. The labelsassigned to testing instances by a binary classifier are compared with the true alreadyknown labels. The results are usually arranged in a 2×2 table called contingency table orconfusion matrix. The contingency table can be thought of as a cross product between thetrue and the predicted labels. The rows refer to the true labels of testing instances, whilethe columns refer to the predicted labels. The example contingency table 2.1 contains 30correctly classified positives and 40 correctly classified negatives on one diagonal. Theother diagonal shows the number of incorrect predictions.

Predicted ⊕ Predicted ∑

True ⊕ 30 20 50True 10 40 50∑

40 60 100

Table 2.1: A contingency table obtained after testing a binary classifier

With the help of the indicator function I many performance measures can be definedfrom a contingency table. The indicator function returns 1 if its condition evaluatesto true and 0 otherwise. The prediction accuracy acc is defined as the proportion ofcorrectly classified testing instances

acc =1

|Dtest|∑

x∈Dtest

I(y(x) = y(x))

3In other tasks, the output of the model may not necessarily be the labels.

10

Page 19: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

The error rate is the proportion of incorrectly classified instances y(x) 6= y(x). Clas-sification techniques often aim at reducing the error rate. The accuracy and error ratemeasures are two sides of one coin err = 1− acc. Error rate is given by

err =1

|Dtest|∑

x∈Dtest

I(y(x) 6= y(x))

Another measure is the true positive rate tpr, which is the proportion of positives cor-rectly classified. True positive rate is called sometimes the recall, because it representshow many true positives the classifier has “captured”:

tpr =

∑x∈Dtest

I(y(x) = y(x) = +1)∑x∈Dtest

I(y(x) = +1)

True positive rate estimates the probability that an arbitrary positive is classified cor-rectly. In the same way, the true negative rate tnr is defined. tpr and tnr distinguish theperformance of the binary classifiers on each class. Let us now have a look on precisionprec, a measure used in combination with the recall. The precision, regarding the positiveclass, is the proportion of true positives to the number of predicted positives. Usually twoclassifiers are compared based on their precision, while achieving the same predefinedrecall value.

prec =

∑x∈Dtest

I(y(x) = y(x) = +1)∑x∈Dtest

I(y(x) = +1)

If the predictions of a binary classifier on a test set result in the contingency table 2.1,then the performance of the classifier can be measured by the measures defined earlier.The classification (prediction) accuracy is acc = 1

100(30 + 40) = 0.7. The classification

error rate is err = 1100

(20 + 10) = 0.3. When the recall tpr = 30/(30 + 20) = 0.6, theclassifier reaches a precision prec = 30/(30 + 10) = 0.75. A better classifier can reach aprecision higher than 0.75 while achieving a recall of 0.6.

Cross-validation

The amount of labeled data in practice may be insufficient to split into training andtesting data sets. Cross-validation is a wildly used method to overcome this problemby using the labeled data for both training and testing. It is called also n-fold cross-validation because it divides the labeled data set into n random folds (subsets) of thesame size mi where 1 < i < n. Then for all is a model is trained from all folds exceptthe i-th fold which is used to test the learned model. The reported performance of themodel is usually the average accuracy (or error) of the folds known as the cross-validationaccuracy

accCV =1

n

n∑i=1

acci

11

Page 20: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

However, one important use of cross-validation is model selection [MRT12]. If an algo-rithm has free parameters to be set from a small number of possible value combinations,n− fold cross-validation can find the best one to use. In Machine learning applications,n is typically chosen to be 5 or 10 [MRT12]. Let λ denote the free parameters of thealgorithm. For each combination cross validation is applied and the cross-validation ac-curacy accCV is reported. The value λ0 for which the accCV is the biggest (or errCV isthe smallest) is chosen.

Suppose that we have already a training and testing data sets. In this case, n-fold cross-validation is applied to the training data. The learning algorithm is started with theparameter settings λ0 in order to train an “optimized” model. The trained model is thentested on the testing data set and its performance is reported as usual. We will seelater in section 2.1.5 that cross-validation is very suitable to find the best value for theparameter C of the machine learning algorithm.

2.1.3 Decision trees

Tree models are used widely in machine learning to solve different tasks. Decision trees, ortree-based classifiers, are one predictive variant of them used for classification. Decisiontrees are expressive and can be nicely presented which makes them interpretable by non-experts in the machine learning field (see figure 2.3). In order to predict, the decision treehas to be traversed top-down starting from the root till reaching one leaf. The classesare found at the leaves of the tree with each leaf representing only one of the classes.The same class, however, may be found in more than one leaf. On the other hand, eachinternal node represents a decision based on the value of one feature determining whichchild node to visit next. The class of the reached leaf will be the label predicted for thegiven unlabeled record.

Figure 2.3 is an example of a decision tree with two class: single and married. Pleasenote that (income > 70000) is a binary feature constructed from the income attribute.This feature may be interpreted as a binary virtual attribute called rich which indicateswhether or not the person is rich. A decision tree can be easily converted into a set ofrules labeling each instance based on its feature values. The decision tree in figure 2.3can be easily interpreted into the following rules:

• If the person is 35 or older, predict that she/he is married;

• If the person is 24 or younger, predict that she/he is single;

• If the age is between 24 and 35 and the income is more than 70000, predict theperson to be married;

• If the age is between 24 and 35 and the income is 70000 or less, predict the personto be single.

12

Page 21: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Figure 2.3: A decision tree with two classes: married and single

Learning decision trees from training data is called tree induction. Decision trees can beeasily induced using a divide-and-conquer algorithm. A divide-and-conquer algorithm di-vides the original problem into smaller problems, which are divided in turn until reachingsimple easy-to-solve problems. If the smaller problems are of the same form as the orig-inal one, these algorithms can be implemented recursively. The InduceTree Algorithm2.1 induces a decision tree recursively in a top-down fashion. The training data set con-sisting of labeled data instances should be passed in the first call as the first argumentof the algorithm, i.e., InduceTree(Dtrain, F )

Three functions are assumed to be predefined and presented to the InduceTree algo-rithm. Homogeneous(D) returns true only if the training instances in set D are “ho-mogeneous enough” to be classified in the same class. This is the case if (almost) alltraining instances are assigned the same class label. Label(D) returns the (majority)label assigned to the training instances in set D. ClassBestSplit(D,F ) splits the set Dinto subsets according to one feature f ∈ F .

Algorithm 2.2 illustrates the ClassBestSplit(D,F ) function. The best split obviouslyis to have each leaf containing only instances of the same class. The leaf is describedas pure in this case. Thus measures of impurity of a (labeled) data set D are definedImp(D), e.g., Gini Index. The impurity of a set of mutually exclusive (labeled) data sets{D1, D2, ..., Dl} is defined as a weighed average of the impurity of each set. This formulaworks with all impurity measures. The best split of a data set D then has the lowestimpurity of the resulting subsets Di.

Imp({D1, D2, ..., Dl}) =l∑

j=1

|Dj||D|

Imp(Dj)

where D = D1 ∪D2 ∪ ... ∪Dl

13

Page 22: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Algorithm 2.1: InduceTree(D,F ); source [Fla12] updated

input : labled data D,set of features F

output: decision tree T

1 if Homogeneous(D) then2 return Label(D)

3 Sf ← ClassBestSplit(D, F);4 foreach subset Di in S do5 if Di 6= ∅ then6 Ti ← InduceTree(Di, F)

7 else8 make new leaf Ti;9 Ti ← Label(D);

10 return Ti11 end

12 end13 label S with f ;14 Set each Ti as a child of S;15 return Tree T whose root is S

Algorithm 2.2: ClassBestSplit(D,F ); source [Fla12] updated

input : labeled data D,set of features F

output: D split into subsets Di according to feature f

1 imin ← 1;2 foreach feature f ∈ F do3 split D into subsets D1, D2, ..., Dl according to the values of f ;4 if Imp({D1, D2, ..., Dl}) < Imin then5 Imin ← Imp({D1, D2, ..., Dl});6 fbest ← f ;

7 end

8 end9 Sfbest ← {D1, D2, ..., Dl};

10 return Sfbest

14

Page 23: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

The induced decision tree may be post processed by an operation called pruning. Thiswould help to produce more expressive decision trees. A pruning algorithm merges two ormore leaves (which may not necessarily have the same label) into their parent. The parentin this case becomes a new leaf labeled with the (majority) class of its previous children.Some pruning algorithms even use separate training data not used while inducing thetree to complete this task.

2.1.4 Support vector machines (SVM)

Assume that each data instance is described by d features, whose feature domains Fare real numbers R. The instance space now is X = Rd and can be represented geo-metrically as d-dimensional space. Data instances are thus vectors in the input spacex = (x1, x2, ..., xd) ∈ X. A classifier can be represented as a hyperplane (a line in a 2-dimensional space) called the decision boundary4. Hyperplane functions are called linearclassifiers. They have linear equation of the form

f(x) = a+ b1x1 + b2x2 + ...+ bdxd = a+ b · x = 0

and they are orthogonal to the normal vector b = (b1, b2, ..., bd). By using homoge-neous coordinates a + bx can be written as bo · xo, where bo = (a, b1, ..., bd) andxo = (1, x1, ..., xd). Linear models, in contrast to tree models, have a fixed structurewith a number of parameters (vector b). To learn the linear model, these parametersneed to be learned from training data. In addition, variations in the training data haveless impact on the learned linear model.

Training instances xi ∈ Dtrain are linearly separable if they can be separated (classified)with a hyperplane decision boundary. An infinite number of hyperplanes may exist toseparate the training data. However, some separate the data better than others. Theoptimal linear classifier should have the same distance to the nearest training instancesof each class. This optimal hyperplane creates the maximum margin to training instancesof each class. The nearest training instance(s) to the decision boundary are called supportvectors. A support-vector classifier is the optimal linear classifier, see figure 2.4. Thissection introduces support vector machines for binary classification.

The decision boundary of support vector machines has the equation w · x = t and itmaximises the margin between training instances. In this binary classification problem,we mark training instances x with w · x > t as the positives (yi = +1) and instanceswith w ·x < t are the negatives (yi = −1). The parallel hyperplane crossing the nearestpositive training instance(s) is w ·x = t+m. The parallel hyperplane crossing the nearestnegative training instance(s) is w ·x = t−m. The margin, which should be maximised,is given now by 2m

‖w‖ . Suppose m = 1 then the two inequalities of positives and negatives

4The decision boundary can be also non-linear.

15

Page 24: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Figure 2.4: Support vector machine in a two-dimensional space; source [Fla12]

can be combined into one

yi(w · x− t)− 1 ≥ 0 where 1 ≤ i ≤ n, |Dtrain| = n

Thus in order to find the support vector classifier we need to maximise 2‖w‖ . Maximising

2‖w‖ corresponds to minimizing ‖w‖, or 1

2‖w‖2 which is more suitable as we will see

later. To find the optimal classifier is to solve the optimization problem of finding w andt which minimise 1

2‖w‖2, denoted by

w∗, t∗ = argw,t min

{1

2‖w‖2

}(2.1)

with subject to yi(w · xi − t) ≥ 1, 1 ≤ i ≤ n

This is a quadratic constrained optimization problem that can be solved using the La-grange multipliers method. Adding the constraints as multipliers αi ≥ 0 for each training

16

Page 25: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

instance i gives the Lagrange function

Λ(w, t, α1, ..., αn) =1

2‖w‖2 −

n∑i=1

αi(yi(w · xi − t)− 1)

=1

2‖w‖2 −

n∑i=1

αiyi(w · xi) +n∑i=1

αiyit+n∑i=1

αi

=1

2w ·w −w ·

n∑i=1

αiyixi

+ t

n∑i=1

αiyi

+n∑i=1

αi (2.2)

In order to minimise the function Λ according to t, equation (2.2) should be partiallyderived with respect to t and we have to find the value of t that makes the partialderivative function equal to zero:

∂tΛ(w, t, α1, ..., αn) = 0

n∑i=1

αiyi = 0 (2.3)

In the same way the optimal w is found, the function Λ is partially derived with respectto w and the partial derivative function is made equal to zero.

∂wΛ(w, t, α1, ..., αn) = 0

w −n∑i=1

αiyxi = 0

w =n∑i=1

αiyixi (2.4)

Vector w is called the weight vector and it is, according to 2.4, a linear combination ofthe training instances. If αi = 0 for an instance xi, it can be removed from the trainingdata without affecting the learned decision boundary. Support vectors, which are thenearest to the decision boundary, are the only ones with αi > 0. Hence support vectorscompletely determine the decision boundary. Replacing the two optimizing formulas (2.3)and (2.4) in the Lagrange function (2.2) gives

Λ(α1, ..., αn) = −1

2

n∑i=1

αiyixi

· n∑

i=1

αiyixi

+n∑i=1

αi

= −1

2

n∑i=1

n∑j=1

αiαjyiyjxi · xj +n∑i=1

αi (2.5)

17

Page 26: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

The quadratic dual optimization problem [Fla12] for support vector machines is (2.6).It is defined by maximising the function Λ defined in (2.5) under positivity constraintsαi ≥ 0 and one equality constraint

∑ni=1 αiyi = 0. Under the Karush-Kuhn-Tucker

conditions (KKT), the solution to the dual optimization problem is the solution to theoriginal optimization problem (2.1)

α∗1, ..., α∗n = argα1,...,αn

max

{−1

2

n∑i=1

n∑j=1

αiαjyiyjxi · xj +n∑i=1

αi

}(2.6)

with subject to αi ≥ 0, 1 ≤ i ≤ n andn∑i=1

αiyi = 0

Equation (2.6) shows that the optimization problem of support vector machines is entire-ly defined by pair wise dot products between training instances xi ·xj . The pair wise dotproducts can be pre-calculated and combined in a matrix M called the Gram matrix.M = XXT where Xn×d is the training instances matrix. Class labels of the traininginstances can be also incorporated to calculate X ′ and M ′ = X ′X ′T , where X ′ = yiXand 1 ≤ i ≤ n. Now we can replace each yiyjxi · xj in (2.6) with the corresponding M ′

element mij. Dedicated quadratic optimization solvers can solve (2.6).

After getting the (linear) decision boundary function, the classification function can bedefined easily, just like any usual linear classifier y : X → {−1,+1}. Each future datainstance x ∈ X will be classified by detecting whether or not its dot product with theweight vector w is higher than the threshold t

y = signw · x− t

The distance between a data instance x and the decision boundary is denoted by d. Itcan be positive or negative according to pointing direction of the instance vector andthe weight vector w. The distance d can be thought of as a confidence measure providedwith the prediction y(x). Instances far from decision boundary have high confidence(probability) that their prediction is correct. However, instances close to the decisionboundary are more confusing.

d =w · x− t‖w‖

(2.7)

2.1.5 Soft margin SVM

If the training data is not linearly separable, then the optimization problem needs to beadapted properly in order to accept outliers. Outliers are training instances which areinside the margin (also know as margin errors [Fla12])or misclassified, i.e., on the wrongside of the margin. The idea is to introduce a slack variable ξi ≥ 0 for each training

18

Page 27: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

instance (see figure 2.5). Thus, the constraints and the function to be minimized arechanged to form the following soft margin optimization problem

w∗, t∗, ξ∗i = argw,t,ξimin

{1

2‖w‖2 + C

n∑i=1

ξi

}subject to yi(w · xi − t) ≥ 1− ξi and ξi ≥ 0, 1 ≤ i ≤ n

Figure 2.5: A separating hyperplane with two outliers; source [MRT12]updated

The parameter C trades off margin maximisation against slack variable minimisation.A high value of C results in more margin errors or even misclassified training instances.A lower value allow to achieve a large margin which implies that less support vectorsare needed to learn the boundary. Thus, the parameter C controls the complexity of theSVM and is referred to as the complexity parameter. The complexity parameter of a softmargin SVM algorithm is usually set using n-fold cross-validation (see cross-validationin 2.1.2) The Lagrange function is given then as

Λ(w, t, ξi, αi, βi) =1

2‖w‖2 + C

n∑i=1

ξi −n∑i=1

αi(yi(w · xi − t)− (1− ξi))−n∑i=1

βiξi

=1

2‖w‖2 −

n∑i=1

αiyi(w · xi) +n∑i=1

αiyit+n∑i=1

αi +n∑i=1

(C − αi − βi)ξi

= Λ(w, t, αi) +n∑i=1

(C − αi − βi)ξi

19

Page 28: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

The betas βi are positive multipliers added to the Lagrange function one for each slackvariable constraint ξi ≥ 0. The optimal ξi requires that every partial derivative withrespect to ξi is zero. This results in C −αi−βi = 0 where 1 ≤ i ≤ n. Since every αi andβi is positive C becomes an upper bound of αi. This upper bound is seen as additionalconstraint in the soft margin dual optimization problem given by

α∗1, ..., α∗n = argα1,...,αn

max

{−1

2

n∑i=1

n∑j=1

αiαjyiyjxi · xj +n∑i=1

αi

}

with subject to C ≥ αi ≥ 0, 1 ≤ i ≤ n andn∑i=1

αiyi = 0 (2.8)

The beta multipliers are similar to the alphas where αi = 0 implies that yi(w ·xi−t) > 1and thus x is not a support vector. When βi = 0 this implies that ξi > 0 which meansthat x is an outlier. A solution to the soft margin dual optimization problem divides thetraining instances in three groups [Fla12]:

• αi = 0 : these instances are outside or on the margin (but not support vectors);

• 0 < αi < C : these instances are support vectors on the margin;

• αi = C : these instances are on or inside the margin (maybe misclassified).

The weight vector w is given as in the “hard margin” SVM as

w =n∑i=1

αiyixi

The threshold of the boundary equation can be calculated from any support vector, i.e.,from any training instance xj with C > αj > 0. For such vector we have

yj(w · xj − t) = 1

w · xj − t = yj

t = −yj + w · xj

t = −yj +n∑i=1

αiyi(xi · xj) (2.9)

20

Page 29: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

2.2 Active learning

Active learning is a subfield of machine learning and artificial intelligence. It is concernedwith studying machine learning systems that improve by asking questions. “The keyidea is that a machine learning algorithm can perform better with less training if it isallowed to choose the data from which it learns” [Set12]. This implies that active learningutilizes the supervised machine learning setting (see supervised machine learning insection 2.1.1). The teacher, however, should be present not only before but also duringrun time of the active (supervised) learning algorithm. The teacher in literature aboutactive learning is also called the oracle. The active learning algorithm (learner) doesnot stay passive during the learning process. “It may pose queries usually in the formof unlabeled data instances to be labeled by the oracle” [Set12]. Active learning in thissense is called query learning. It is also closely related to “optimal experimental design”or “sequential design” in statistics. Active learning is introduced in this thesis with theconcepts of binary classification in mind (see section 2.1.2).

An active learning algorithm works iteratively. It begins with only a few labeled instances(less than ten) as training data which is called the seed. In each round the active learningalgorithm greedily chooses, according to an utility measure, the “best” data instances(s)for querying. After getting their true labels from the oracle, the newly labeled instancesare added to the training data (the former seed). The model(s) will be re-trained fromthe updated training set, and then the next round starts. The active learning algorithmcycles until a stopping criterion is met. The active learning algorithm thus chooses itstraining data, denoted by L. At the same time, it utilizes passive machine learningalgorithms to train a (predictive) model θ. The learned model is usually the output ofactive learning algorithm. Alternatively, the training data itself may be the output ofthe active learning algorithm. A new model has to be trained passively then from theactively obtained training data. Usually, this model should be of the same kind of themodel(s) used in the active learning algorithm.

The utility measures decide which query the learning algorithm should ask next. Queryselection frameworks and the utility measures they use are presented in section 2.2.2.Utility measures usually choose the most informative instance as the best data instancex∗ for querying. The most informative instance shall mostly help the learned model toresolve its uncertainty about the training data. The active learning algorithm would notquery instances “similar” to what it already knows, because they are not informative.As a result, the size of the training set |L| shall be smaller than any randomly sampledtraining set. The actively trained (predictive) model, however, would have the sameprediction accuracy (or even better) than the one trained passively from the larger,randomly sampled training set. Many papers utilizing active learning have reported thisresult, e.g., (Tejada [TKM02]) as they compared their Active Atlas system with theirold passive one.

21

Page 30: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Active learning is appropriate in many real-world problem, e.g., speech recognition, in-formation extraction and filtering. In these cases, unlabeled data are easy to get, buttheir labels are difficult, time-consuming or expensive to obtain. Less training wouldlargely reduce the expenses in such cases, which motivates active learning from econom-ical viewpoint. Smaller training data also means that the human annotation effort isreduced. The teacher only needs to label a small number of instances as a seed. Everyround she or he has to label just few instances picked by the active learning algorithm.Nevertheless, if active learning is appropriate, its algorithm may access the unlabeleddata and ask queries in different ways. These are called data access scenarios and willbe discussed in (section 2.2.1).

We should be careful before applying active learning principles to our learning algo-rithms. Learning a spam e-mail filter, for example, is more appropriate using passivesupervised learning. This is because training data instances are numerous and easy toget. Everyone wants her or his inbox to stay clean and gives spam flags for unwantede-mails voluntarily. Labeling an e-mail as spam, on the other hand, comes at no costand can be done with a single click. The expense of applying an active learning queryframework in this case is much greater. In addition the model and the active learningalgorithm used to train it should be already decided upon. This may impose that theproblem domain is well studied and the best model to solve the learning task is alreadyknown. One more thing to make sure of, is that the oracle is reliable, i.e., noise-free. Ifthe oracle is unreliable, noise should be taken into account as well. The labeled traininginstances in this case might be re-queried based on adapted utility measures.

2.2.1 Data access scenarios

As mentioned earlier, the costs of obtaining training data decide mainly whether or notactive learning is appropriate. If the answer is yes, then several ways exist in whichthe active learning algorithm may generate queries. We will have a look at these waysand then present query selection frameworks which decide the best instance(s) to query.Settles [Set12] differentiates in his book between three main scenarios found in literatureto access the data in active learning. These are query synthesis, stream-based selectivesampling, and pool-based sampling.

Query synthesis

This scenario assumes that the active learning algorithm has only a definition of theinstance space. It may request labels for any instance in the input space. The activelearning algorithm synthesizes data instances and asks the oracle for their labels. Thesynthesized data instances may not originally exist in the data set, on which we want topredict later. As a result, the algorithm queries without necessarily considering the un-

22

Page 31: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

derlying natural distribution of the unlabeled data. This simple scenario may also resultin many repeated queries, which can be awkward for the human annotator. In addition,the synthesised queries, despite their informativeness, may be not recognizable by hu-mans. For example, let us consider an active learning system to recognize handwrittencharacters. A synthesized “character” may have the highest informativeness based on theused utility measure. However, if it is not identifiable by humans it would not be usefulfor learning. Query synthesis may be promising in domains where labels do not comefrom humans, but from experiments or something else.

Stream-based selective sampling

Since active learning is appropriate, obtaining an unlabeled instance is inexpensive. Oneunlabeled instance is sampled from the underlying real distribution, i.e., from the unla-beled data. The active learning algorithm decides then whether to query or discard. Whenthe input data has a uniform distribution, selective sampling may offer no advantagesover query synthesis. The advantages become only clear when the input distribution isnon-uniform and unknown. The algorithm draws or scans unlabeled instances sequen-tially, one at a time from the streaming data source. The decision for each instance ismade individually based on the used utility measures presented later. This scenario maybe the most appropriate in mobile and embedded devices, where memory or processingpower may be limited. One more case would be when the data is too large for memoryand must be scanned sequentially from disk.

Pool-based sampling

In this scenario, a large (non-changing) amount of unlabeled instances is gathered. Pool-based active learning assumes a small set of labeled instances L. Whereas, a large pool ofunlabeled instances U is available, i.e., as set. All instances in the pool are evaluated andranked according to a utility measure. The learning algorithm then chooses the “best”instances to query. A subset of U can be used as pool if the size of the pool |U| is verylarge. This is probably the most popular scenario in many real-world applications, e.g.,text classification and information extraction. We can often assume pool-based activelearning if no access scenario is explicitly mentioned. One should be careful not to mixbetween pool-based and stream-based in literature. Some authors use the term “selectivesampling” to refer to pool-based scenario and not the stream-based one. Their use of theterm can be interpreted as making queries with a selected subset which is sampled fromthe available pool of data.

23

Page 32: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

2.2.2 Query selection frameworks

Uncertainty sampling

The first query selection framework to be investigated is uncertainty sampling or “un-certainty reduction”. Uncertainty sampling active learning is also called “query by un-certainty” (Olsson [Ols09]). The idea is to sample from uncertain areas of data whichare found confusing by the trained model. Unlabeled instances for which the model cannot confidently predict are considered confusing. These instances are the ones queried tobe labeled by the oracle and added to training set L. Therefore, the active learning al-gorithms utilizes uncertainty measures to measure the uncertainty of candidate queries.The instance from the pool x ∈ U with the most uncertainty according to an uncer-tainty measure M , is denoted by x∗M . It is the best instance x∗ which will be queried.Algorithm 2.3 is a generic uncertainty sampling algorithm that can be used with any ofthe uncertainty measures presented later.

Algorithm 2.3: Generic pool-based uncertainty sampling algorithm; source[Set12] updated

input : pool of unlabled instances U = {xi},seed of labeled instances L = {(xi, yi)}

output: predictive model θ

1 repeat2 θ ← train(L);3 x∗ ← x∗θ,M ∈ U ;

4 y∗ ← query(x∗);5 L ← L ∪ {(x∗, y∗)};6 U ← U \ {x∗};7 until stoppingCriterion;8 return model θ

The active learning algorithm stops when stoppingCriterion is met. A passive supervisedlearning algorithm train() is used. It returns a model trained from the labeled instancespassed to it as parameter. The help method query() asks the oracle to label the instancepassed to it and returns the label back to the algorithm. The algorithm makes clear howintuitive uncertainty sampling is. Uncertainty-based utility measures are easy to use aswe will see later. Uncertainty sampling is simple and easy to implement which makes itvery popular in practice.

In the terms of binary classification using support vector machines, the best queries arefound near the decision boundary. The closest data instances to the decision boundaryare the most confusing, because the confidence of their prediction d ≈ 0, see equation(2.7). However, more general confidence measures are needed for models which can not

24

Page 33: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

be expressed geometrically. Probabilities can be nicely utilized for this purpose, althougha probabilistic interpretation is not strictly necessary for active learning. We use Pθ(y|x)to denote the probability that a classifier θ assigns the class label y to the instance x.This probability can be interpreted as how mush the model is confident (unconfused)about its prediction. In addition, y denotes the most probable label to be assigned to x.Mathematically written, y = argy max {Pθ(y|x)} is the most confident prediction madeby model θ for instance x. Now we can define three uncertainty measures based onprobability as a confidence measure.

The first utility measure that can be used with uncertainty sampling is least confident.This measure would result in querying the instance whose most confident prediction is theleast confident among all instances in the pool. This measure only takes the informationabout the best prediction for each instance into account.

x∗LC = argx min {Pθ(y|x)} = argx max {1− Pθ(y|x)} (2.10)

The margin measure not only considers the most probable prediction y1. It involvesthe second probable prediction as well denoted by y2. This measure suggests that abig margin between the first and second prediction mean the prediction is confident.Otherwise there is a confusion and the instance should be queried. The true label shallbest help the model to differentiate between the two most probable labels.

x∗margin = argx min {Pθ(y1|x)− Pθ(y2|x)} (2.11)

The third measure is the entropy which takes the probabilities of all labels into account.Entropy stems originally from the information theory field and it was presented by(Shannon [Sha48]). The entropy of a source, denoted by H represents the minimumaverage number of bits required to uniquely encode a single source symbol. It is a measureof average information contained in each symbol. It is utilized in machine learning asuncertainty measure or impurity measure (recall decision trees 2.1.3). Given the set oflabels is Y , the instance which contains the most information should be queried.

x∗H = argx max

{−∑y∈Y

Pθ(y|x) logPθ(y|x)

}(2.12)

Entropy is the most common measure used in uncertainty sampling active learning.However, any of the three measures can be employed in algorithm 2.3. One interestingpoint is that all measures reach their peak when the probability of predicting each labelis equal. The most informative instance has a prediction probability equals to 1

|Y | for

each label in L. As a result, the instance with Pθ(y|x) ≈ 0.5 in a binary classificationproblem will be queried where y ∈ Y = {+1,−1}. This reduces to querying the instancesclosest to decision boundary when using SVM. In fact, any confidence score (probability,distance, etc.) provided along the model predictions can be employed in the uncertaintymeasures presented above.

25

Page 34: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Query by committee

A hypothesis is a predictive model which generalizes from training data to make predic-tions on unseen data instances (see supervised in 2.1.1). The hypothesis space is the setof all hypotheses, denoted by H. The version space V ⊆ H is a subset of hypotheseswhich are consistent with the training data L. Consistent indicates that hypotheses ofV make correct predictions for all (labeled) training instances in L. Now, suppose oneof the hypotheses in the version space represents the target prediction (labeling) func-tion that need to be learned. Obtaining more training (labeled) instances will reducethe number of candidate hypothesis |V|. The remaining candidate hypotheses in V willthus approximate the target function more accurately. As a result, active learning shouldquery instances which result in minimizing the size of version space the most.

When using SVM for binary classificationH is all possible hyperplanes in the input spaceRd. The version space V is the subset of hyperplanes separating the linearly separabletraining instances. Figure 2.6 shows an example decision boundary with four supportvectors. SVM (on the left) achieve the largest margin of separation of training instancesand are here the target function. The highlighted area (right) represents the versionspace V ⊆ H. Any hyperplane inside the highlighted area is a possible candidate of thetarget function (decision boundary). Knowing the label of any data instance inside thisregion would shrink the version space. That is because not all the candidate hyperplaneswould stay consistent with the new training set. As a result, the highlighted area definesthe region of disagreement among the candidate hypotheses. We should search throughthis region in order to find the candidate with the highest disagreement.

Figure 2.6: The version space and the target function; source [Set12]

The query by committee algorithm, shortly QBC, maintains a so-called committee ofhypotheses C = {h1, h2, ..., hk}. The committee serves as an approximation of the set ofcandidate hypotheses C ≈ V . This implies that all hypotheses of the committee should

26

Page 35: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

be consistent with the current training data L. The reason behind the committee is thedifficulty of measuring the disagreement of all candidate hypotheses in practice. Theregion of disagreement is given now by disagreement measures among committee mem-bers. The most informative instance to query is the one which has the most disagreementscore among committee members. Knowing its label shall mostly minimized the versionspace. Query by committee approach is presented in algorithm 2.4.

Algorithm 2.4: Pool-based query by committee algorithm; source [Set12] up-dated

input : pool of unlabled instances U = {xi},seed of labeled instances L = {(xi, yi)}

output: training data L1 repeat2 C ← generateCommittee(L);3 x∗ ← x∗C,M ∈ U ;

4 y∗ ← query(x∗);5 L ← L ∪ {(x∗, y∗)};6 U ← U \ {x∗};7 until stoppingCriterion;8 return set L

The active learning algorithm stops when stoppingCriterion is met. The algorithm needsa function generateCommittee() which can train hypotheses consistent with the labeledinstances passed to it as parameter. It uses a passive supervised learning algorithminternally. The instance from the pool x ∈ U on which the committee members mostlydisagree according to measure M , is x∗C,M . This instance is the most informative x∗ andit should be queried. The help method query() asks the oracle to label the instancepassed to it and returns the label back to the algorithm. It is clear now that two thingsare required to implement the QBC strategy:

(1) have a way to construct a committee of hypotheses which are consistent with theversion space;

(2) have a measure of disagreement among committee members.

Many alternatives can be found in literature to solve (1). Committee hypotheses may berandomly sampled from the version space. For example, randomly choosing parametersfor the linear hyperplane equation. In practice, however, randomized sampling tends tobe computationally expensive and it does not work for noisy data. Alternatively, (Abeand Mamitsuka [AM98]) have proposed query-by-boosting (QBoost) and query-by-bagging(QBag). Originally boosting and bagging are techniques to enhance the performance ofan existing learning algorithm. The idea of both is to run the learning algorithm many

27

Page 36: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

times on a set of re-sampled data. One more point related to (1) is the number ofcommittee members. There is no general agreement in the literature on the appropriatecommittee size to use. The number may vary by model or application from five to fifteen.However, even small committee sizes (two or three) have been shown to work well inpractice [Set12].

Regarding (2) two main measures have been proposed for disagreement. The first one isvote entropy presented by (Dagan and Engelson [DE95]) which can be thought of as aQBC generalization of entropy-based uncertainty measure. The prediction a hypothesish makes for an instance is its vote. We define the number of votes a label y receives foran instance x among the hypotheses of committee C as VC(y, x) =

∑h∈C 1{h(x)=y}. Given

the set of labels is Y , vote entropy is defined as

x∗V E = argx max

{−∑y∈Y

VC(y, x)

|C|log

VC(y, x)

|C|

}

This is called the “hard” entropy measure. A “soft” entropy measure can be defined.Suppose a confidence measure is provided by each hypothesis with its prediction, e.g., theprobability Ph(y|x). We can define a committee confidence for each prediction PC(y|x) =1|C|∑

h∈C Ph(y|x). This is now the “consensus” confidence [Set12] of the committee C thaty is the prediction for x. Soft entropy is defined as

x∗SV E = argx max

{−∑y∈Y

PC(y|x) logPC(y|x)

}

We can see that both measures are variants of the entropy uncertainty measure definedin equation (2.12). Instead of using a single-hypothesis confidence measure, a committee-

based generalization is applied. The confidence of a committee of hypotheses VC(y,x)|C| and

PC(y|x) are used instead of a single hypothesis. Such generalizations can be also made forthe least certain equation (2.10) and the margin measure in equation (2.11). In binaryclassification, the vote entropy disagreement reduces to the absolute difference betweenthe number of votes for each class |VC(+1, x)− VC(−1, x)|.

Another disagreement measure has been proposed based on Kullback-Leibler (KL) di-vergence (Kullback and Leibler [KL51]). KL divergence from information the theory fieldis a measure of the difference between two probability distributions. The disagreementamong committee members is defined as the average divergence of each prediction ofcommittee hypotheses and the prediction of the “consensus” committee.

x∗KL = argx max

{1

|C|∑h∈C

KL (Ph(y|x)‖PC(y|x))

}

where KL (Ph(y|x)‖PC(y|x)) =∑y∈Y

Ph(y|x) logPh(y|x)

PC(y|x)

28

Page 37: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

The KL measure represents the spirit of the QBC strategy more than vote entropy.Suppose we have an instance whose “consensus” prediction is uncertain for the wholecommittee. Entropy vote would suggest this instance for querying even if it is uncer-tain for each one of committee members, i.e., the members agree. KL divergence, onthe other hand, would not suggest querying the instance in this case. It would preferanother instance whose “consensus” prediction is also uncertain for the committee, butthe members disagree more on its label.

2.3 Data matching

Data matching can be defined as “the process of comparing records from two (or more)databases in order to identify pairs or groups of records that refer to the same real-world entity. These pairs or groups of records are known as matches” [Chr12]. Datamatching is referred to as record linkage, object identification, duplicate detection andentity resolution. In a single database, however, data matching is called deduplication.Data from different kinds of sources may be matched, but usually data matching isapplied to databases. Each database record represents one real-world entity. The datamatching process, according to the “Data Matching” book authored by Peter Christen[Chr12], is composed of five major steps illustrated in figure 2.7

Figure 2.7: Data matching steps of two databases; source [Chr12] updated

29

Page 38: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

The data used for data matching can vary in format, structure and content. Thus, thefirst step is data pre-processing, which assures that the data from both sources are in thesame format. The second step, indexing, reduces the database quadratic complexity ofthe data matching process. Quadratic complexity comes from the fact that each recordin one database, theoretically, must be compared with all records in the other databaseto detect matches. Next, candidate record pairs are generated from the indexing datastructures and compared using a variety of similarity measures. The similarity scoresbetween the pair of compared records are collected in comparison vectors. In the classi-fication step, a matcher (matching function) performs the actual matching. It classifieseach comparison vectors as “match” or “non-match”. However, an additional class canbe used which is called then “potential match”. Finally, the precision and recall of thematcher are evaluated in the evaluation step. Matching precision refers to how many ofthe predicted matches correspond to true ones. Recall, on the other hand indicates howmany of the true matches were captured.

The classification step can be fulfilled in several ways. Supervised and unsupervisedmachine learning (see section 2.1) may be utilized in this step. Hence comparison vectorsbecome the data instances of the machine learning approaches. Unsupervised learning isdistance-based and uses weights and thresholds to learn a matching function. Supervisedlearning is classification-based and utilizes training data (labeled comparison vectors)to learn a matcher (classifier). Intuitively the classification step can be thought of asa binary classification task with the two classes “match” (positive) and “non-match”(negative) as seen in (section 2.1.2). Active learning approaches (see section 2.2) may beutilized in supervised data matching as well.

2.3.1 Spatial data matching

Data matching is traditionally applied to textual objects by calculating similarity scoresbetween strings. Spatial data matching needs a little attention, because spatial objectsare more complex. A spatial object has, in addition to normal (non-spatial) attributes,spatial attributes holding the geographical and geometric information about the real-world object it represents. Spatial objects may have thematic attributes as well like objectclasses, statistics, etc. Appropriate similarity measures are required for these spatialattributes in a addition to usual commonly known similarity measures. The similarityscores calculated by the similarity measures are called attribute similarities.

The topology of the spatial objects can be utilized in the data matching process. Topologycan be used in order to limit the “search region” for matches. In two pre-processed mapswith unified coordinates, the candidate matches to a spatial object can only be in theregion around it. Dividing data into partitions will reduce the number of candidatematches available for each record. As a result, candidate matches can be more efficientlygenerated. Relations between neighboring spatial objects may be also employed to define

30

Page 39: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

additional similarity measures. The similarity scores calculated by such measures expressthe relational similarity. A pair of spatial objects which has matched neighbors all aroundit, is more likely to be a match. Attribute and relational similarity scores can be combinedtogether to produce a total similarity score.

Different spatial sources may have varied models for the same real world maps. Thisresults in various representations of the same real world objects. One representationmay be formed as a single complex spatial object, while another may be aggregated frommany simpler spatial objects. Therefore, spatial data matching may need to allow m:nmatches in addition to 1:1 matches. Alternatively, the spatial data matching process canonly capture 1:1 matches of aggregated objects. This approach combines several objectsinto one and finds matches between these new composed objects as well. Criterion shouldbe defined, however, to determine and control how aggregations should be built.

2.3.2 SimMatching

This thesis is part of Michael Schafers’ project [Sch13] which aims at matching road-maps from different sources. The roads are stored in database as spatial objects. The datamatching algorithm is called SimMatching and it is similarity-based. It relies on distancesor similarities between spatial objects to detect matches, i.e., it is unsupervised. After apre-processing phase SimMatching divides data into partitions and works iteratively tofind matches. SimMatching uses rules for matching objects that can eliminate or confirma match. More detailed information can be found in [Sch13].

SimMatching uses the spatial data matching concepts presented in section 2.3.1 to fulfilits task. The road objects are divided into segments before the matching starts. Segmentsand aggregations of segments are matched in order to capture potential m:n matches.Attribute and relational similarity measures are employed in SimMatching to calculatesimilarity scores. The scores are combined as a weighted sum to calculate the totalsimilarity score between two (aggregated) spatial objects.

sim(a, b) =∑

wi ∗ simi(a, b) (2.13)

where∑

wi = 1 and simi is a similarity measure

The weight of a similarity measure tells the proportion of its similarity to the totalsimilarity. Hence the weights determine the influence of the similarity measures on thematching process. The similarity scores are normalized so that they fall into the range[0, 1]. If the total similarity score exceeds a certain predefined similarity threshold thetwo objects are considered a match. The weights wi and the threshold thr used inSimMatching are parameters which need to be configured.

sim(a, b) ≥ thr ⇔ (a, b) is a match (2.14)

31

Page 40: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

SimMatching parameters need to be configured for every new datasets being matched.One reason for this is, e.g., the road names are incomplete which makes name similaritymeasures less useful and hence their weight should be adjusted (in this case lowered)appropriately. The parameters are usually set and updated manually util SimMatch-ing achieves satisfying matching results. However, alternative ways may exist in orderto automatically set SimMatching parameters. The automatically found parameters areoptimized, because they shall help SimMatching to match better. This process is whatwe call parameter optimization which is the problem to be solved by this thesis. Fromnow on, the word parameters refers to the value of the similarity threshold and theweights of the similarity measures of SimMatching.

Similarity measures

Similarity measures can be based on attribute or relational similarities. The attributesimilarity measures calculate similarity scores between objects based on their attributevalues. In spatial data the attributes can be semantic (non-spatial) or geometric (spatial).Relational similarity measures compare the“context” in which the objects are found. In aroad-map, for example, the context of a road object is part of the network of road objectsconnected to it. All the measures calculate normalized similarity scores in interval [0, 1].Here are some of the many possible similarity measures that can be used [SL12]:

• The name similarity measure: a semantic attribute measure which compares thenames of two spatial objects based on a string similarity measure between the twoname strings;

• The length similarity measure: a geometric attribute measure which compares thereal length of two spatial line objects;

• The angle similarity measure: a geometric attribute measure which compares theangle between two spatial line objects;

• The start point similarity: a geometric attribute measure which compares the dis-tance between the start points of two spatial line objects;

• The end point similarity: a geometric attribute measure which compares the dis-tance between the end points of two spatial line objects;

• The object class similarity: a semantic attribute measure which returns the similar-ity between the “object type” of each of the two spatial objects based on manuallydefined similarity scores;

• The direct neighborhood similarity (DNS ): a relational measure which returns thecontext similarity between two spatial objects as a ratio. The ratio is the proportionof the confirmed neighboring pairs to the confirmed and possible neighboring pairsin the vicinity around the compared pair.

32

Page 41: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

The matching objects

SimMatching uses the term matching to refer to identified matches and non-matches. Weuse the term matching objects to refer to these matchings. “The term matching is used todescribe a single link between two objects representing the same real world object... Everymatching holds references to the involved objects. A similarity value can be declared...Additionally every matching has an explicit status label” [SL12]. The object referencesmay refer to line segments or aggregations. The similarity value is calculated accordingto equation (2.13).

Each matching object must have one of four match statuses: candidate, possible, con-firmed or rejected. Candidate matching objects do not have a similarity value yet. Pos-sible means the similarity value has been calculated but the matching has not beenclassified. Similarity values are updated during the matching process until a decisioncan be made and each matching object is confirmed as being a match or rejected. Afterlabeling each matching (with confirmed or rejected) its similarity score changes no more.The labeled matching objects are only considered for relational similarity measures (seedirect neighborhood similarity above).

SimMatching uses a similarity threshold to classify and thereby label matching objects,i.e., to confirm or reject as shown in equation (2.14). However, constraints may be utilizedto confirm or reject possible matching objects whatever their total similarity is. As aresult, these constraints must also be examined before labeling. For example, a constraintmay be added based on object type in order to prevent matching lines representingrailways with ones representing rivers.

The SimMatching algorithm

SimMatching compares and matches two road maps at each run. The pair of mapsis represented as a link object which has two database components: the two maps Aand B. The similarity measures used with each link may be different based on theproperties of the compared maps. SimMatching parameters may be different as well foreach link. The link object allows to access all the spatial objects of the compared maps.SimMatching takes a link object and its suitable parameters as input, and it outputs thematching objects as matches or non-matches.

The SimMatching algorithm can be divided into two phases: a first initialization phaseand a second iterative phase which conducts the real matching process, i.e., detects thematches. One of the most important things done in first phase is the preprocessing inwhich line segments are constructed and the neighbors of each segment are calculated.The next step after preprocessing is generating initial matching objects. The numberof possible matching objects is reduced efficiently by generating these matchings only

33

Page 42: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

from the vicinity around each spatial object. Additionally, a candidate threshold andcandidate constraints are utilized to eliminate matching objects that do not have thechance to be matches. At the end of the first phase the similarity values are calculatedfor each initial matching object so that the second phase can start.

The second phase iterates over all matching object until all of them have been classified.At each iteration the matching object with the highest similarity value is processed.Aggregations are constructed separately for both spatial objects to which the matchingobject refers. New matching objects are constructed for the aggregations that cause thesimilarity value to increase (compared with the similarity value of the matching objectbeing processed). The matching object with the highest similarity may have changedas a result. Next, the highest similarity matching object is confirmed (classified as amatch) or rejected (classified as non-match). A clean up operation follows to make surethat each spatial object is involved in one matching object at most which guaranteesa unique and conflict-free matching result. Finally, similarity values of the neighboringmatching objects are recalculated because their relational similarity values (see directneighborhood similarity above) may have changed.

Apart from the algorithm phases, SimMatching relies on partitioning in order to beable to process huge data sets of road maps. This is due to SimMatching’s in-memoryprocessing approach and limited memory resources on client machines [SL12]. The spatialdata is partitioned into disjoint rectangular partitions. Since the data in each partition isdisjoint and independent of the data in other partitions, the partitions can be processed inparallel. This noticeably improves the matching performance. For each partition from onedatabase SimMatching defines an expanded rectangular area around the correspondingpartition from the other database (and vice versa) in order to include enough contextinformation for spatial objects near partition borders [SL12].

34

Page 43: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Chapter 3

System Design

This chapter explains the core work done in this thesis. It describes the proposed solutionto the parameter optimization problem introduced earlier in section 2.3.2. The systemis designed based on the concepts which have been already presented in chapter 2. Theoutput of the system are the parameters which can be used to configure SimMatching.The system shall find the parameters optimizing SimMatching performance and henceis referred to as the optimizer. First, the intuition behind the solution is explained andjustified. Then, we look more thoroughly in order to figure out the components whichare required to build the desired system. Finally, we introduce the optimizer algorithmand how it uses the functionality of SimMatching.

3.1 The intuition

We have seen in section 2.3.2 that the SimMatching algorithm utilizes similarities be-tween spatial objects to detect matches. The compared spatial objects are line segmentsrepresenting roads from two different databases. We will use simj to refer to the j-th similarity measure from the set of similarity measures {sim1, ..., simd}. We will usesimj(a, b) to denote the similarity score between two spatial objects a from database Aand b from database B according to the j-th similarity measure. The total similarityscore between a pair of compared spatial objects, denoted by sim(a, b), is calculated asa weighted sum of the attribute and relational similarities as given in equation (2.13).Hence the total score is a linear combination of the used similarity measures. Thesimilarity threshold thr used in equation (2.14) is the boundary between the “matches”and“non-matches”. Since similarity scores are real numbers simj(a, b) ∈ Rd, the equation

35

Page 44: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

of the decision boundary in the space Rd is given by

sim(a, b) =d∑j=1

wjsimj = w1sim1 + ...+ wdsimd = thr (3.1)

Any point x = (sim1(a, b), ..., simd(a, b)) ∈ Rd is considered positive (+1) if and onlyif replacing it within equation (3.1) is greater than thr. Positive in this context meansthat if x represents a comparison vector between a pair of objects (a, b), the pair isclassified as “match”. On the other hand, the point x is considered negative (−1) if theleft hand side of the equation (3.1) after replacing is less than thr. Assume that x is thecomparison vector between (a, b), the pair is classified as “non-match”. The weights ofSimMatching and its similarity threshold may be configured from the parameters of thelinear equation (3.1). Finding concrete values for these parameters automatically wouldprovide a solution to our parameter optimization problem.

The key idea of the solution is to obtain a training set of labeled comparison vectors. Thecomparison vectors contain similarity scores between pairs of spatial line objects. Using aSVM learning algorithm shall train an optimal classifier. The optimal classifier is a lineardecision boundary with maximum margin from the training comparison vectors. Thelinear equation of the optimal decision boundary has optimal parameters which are usedto configure the SimMatching algorithm in order to“optimize its decision boundary”. Theintuition of the solution is clear looking at both boundary equations of SimMatching (3.1)and the SVM decision boundary equation given as

w · x = w1x1 + ...+ wdxd = t

3.2 The comparison vectors

A comparison vector is used to compare two spatial objects a (from database A) and b(from database B). Each comparison vector contains references to the compared objects’IDs: ida and idb. In addition, a comparison vector collects the similarities between the twoobjects as similarity scores. The similarity scores are calculated from similarity measureswhich assess the similarities between the compared objects according to certain aspect(s)[SL12]. Since we want to learn the weights of SimMatching similarity measures, the samemeasures are used inside the optimizer (see similarity measures in 2.3.2). The optimizerobtains the similarity measure to use with each link object (see algorithm in 2.3.2) fromSimMatching automatically at start. A label is associated with each comparison vectoras well which can be “match” or “non-match”.

Comparison vectors can be easily generated from the matching objects of SimMatching(see the matching objects in 2.3.2). The matching object contains all the information

36

Page 45: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

required to construct a comparison vector, i.e, the spatial objects ids and their similarityscores. The method buildInitialMatchings() in algorithm 3.1 builds matching objectscomparing pairs of spatial objects (a, b). It builds a zone around each spatial object afrom database A. The spatial object from database B which matches a (if exists) can beonly found in the direct vicinity around a. One matching object is created from a andeach object from B which lies inside this zone. SimMatching has a buffer parameter,which is not optimized, to control the size of the zone. More about building matchingobjects can be found in [SL12].

Algorithm 3.1: buildInitialMatchings

input : SimMatching parameters parameters

output: Set of Matching objects matchings

1 Set matchings;

2 foreach SpatialObject spatialObjectA in DatabaseA do3 foreach SpatialObject spatialObjectB in spatialObjectA.zone do4 Matching matching ← buildMatching(spatialObjectA,

spatialObjectB, parameters);5 matchings.add(matching);

6 end

7 end8 return matchings;

The buildMatching() method creates a new matching object and calculates the simi-larity scores between the two compared spatial objects. The variable parameters holdsthe similarity threshold and the weight of each similarity measure. The weights are usedby the buildMatching() to calculate the total similarity score between the comparedobjects as a linear combination of the similarity scores returned by the applied simi-larity measures. The DNS scores calculated by buildInitialMatchings() are all zero,because matching objects have been not confirmed yet.

The SimMatching algorithm finds 1:1 matches between aggregations in order to capturem:n matches between line segments. Hence the learned parameters should be optimalto those aggregations as well. Comparison vectors between aggregated spatial objectsshould be generated as a result. The optimizer algorithm should construct aggregationsin the same way SimMatching does. All possible aggregations are constructed from eachone of the compared pair of objects. An aggregation is added as a new spatial object ifthe total similarity score of the pair involving the aggregation is higher than the oneinvolving the original object [SL12]. This aggregation technique only allows aggregationsto increase similarities. The comparison vectors of the aggregations has the same struc-ture as the ones for segments. The identifiers, however, may refer now to aggregationsor segments. The buildAggregatedMatchings() method builds aggregations and thematching objects which contain them as specified in algorithm 3.2.

37

Page 46: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Algorithm 3.2: buildAggregatedMatchings; source [SL12] updated

input : matching object matching,SimMatching parameters parameters

output: set of Matching objects aggSet

1 Set aggSet;2 SpatialObject spatialObjectA ← matching.spatialObjectA;3 SpatialObject spatialObjectB ← matching.spatialObjectB;

4 foreach SpatialObject neighbor in spatialObjectA.neighbors do5 SpatialObject aggObject ← aggregate(spatialObjectA, neighbor);6 Matching aggMatching ← buildMatching(aggObject, spatialObjectB,

parameters);7 if aggMatching.totalSimilarity > matching.totalSimilarity then8 aggSet.add(aggMatching);9 end

10 end11 foreach SpatialObject neighbor in spatialObjectB.neighbors do12 SpatialObject aggObject ← aggregate(spatialObjectB, neighbor);13 Matching aggMatching ← buildMatching(spatialObjectA, aggObject,

parameters);14 if aggMatching.totalSimilarity > matching.totalSimilarity then15 aggSet.add(aggMatching);16 end

17 end18 return aggSet;

A spatial object can only be aggregated with its neighbors which share a common endpoint with it. The aggregate() method creates a new spatial object by merging thegeometries of the original objects (which may be aggregated in turn). This can be done,however, only if certain conditions are not violated, e.g., road names, types, etc. Thecalculated DNS scores by buildMatching() method in buildAggregatedMatchings()

can be greater than zero, since the SimMatching algorithm (see section 2.3.2) aggregatesand detects matches iteratively in its second phase.

3.3 The optimizer as SVM

The above illustration recalls the binary classification task using SVM described in sec-tion 2.1.4. The optimal decision boundary with maximum margin can be learned bySVM. The space dimensionality is specified by d, the number of similarity measuresused. The optimal weight vector w∗ = (w∗1, ..., w

∗d) is given according to equation (2.4)

38

Page 47: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

as a linear combination of the training instances.

w∗ =n∑i=1

α∗i yixi (3.2)

where yi ∈ {+1,−1} is the label of the comparison vector xi = (sim1(a, b), ..., simd(a, b))and the alphas are the solution to the quadratic dual optimization problem presentedin equation (2.6). The set of labeled comparison vectors xi is the training set L where|L| = n. The threshold of the boundary equation can be calculated from any supportvector, i.e., from any training instance xj with α∗j > 0 as

t∗ = −yj +n∑i=1

α∗i yi(xi · xj) (3.3)

SimMatching decision boundary (3.1) may be seen as striving the SVM decision bound-ary. The weight vector w∗ an the threshold t∗ are the parameters of the optimal decisionboundary which shall be used to configure SimMatching. However, the learned SVM pa-rameters need to be normalized in order to conform with SimMatching parameters. Thevalues must be all in range [0, 1] and the weights’ sum should be 1. The learnSimMatch-ingParameters() method presented in algorithm 3.3 learns the optimal parameters ofSimMatching.

Algorithm 3.3: learnSimMatchingParameters

input : training data set Loutput: the optimal parameters parameters

1 SimMatchingParameters parameters;

2 α∗1, ..., α∗n ← quadraticSolver(L);

3 w∗ ←∑n

i=1 α∗i yixi;

4 t∗ ← −yj +∑n

i=1 α∗i yi(xi · xj);

5 weightsSum ←∑n

i=1w∗i ;

6 for i← 1 to d do7 parameters.weights.put(simi, (w

∗i /weightsSum));

8 end9 parameters.threshold ← (t∗/weightsSum);

10 return parameters;

The method quadraticSolver() solves the dual optimization problem in equation (2.6)by finding the optimal values of the alphas. The parameters variable stores SimMatch-ing parameters that have been learned, i.e., optimized. These parameters are the weightsof the used similarity measures and the similarity threshold which decides whether amatching object is confirmed as a match or rejected.

39

Page 48: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

However, the SVM algorithm is supervised and requires a training set in contrast to theSimMatching algorithm. A reference data set is not available which arises the challengeof obtaining a training set of labeled comparison vectors. Comparison vectors on theirown may not be enough to decide what their true labels should be. The spatial objectsmay have to be examined based on their names, lengths, neighbors (topology), etc. Thus,labeling comparison vectors is time consuming and may need a big effort as well. Theunlabeled data, on the other hand, is available and easy to get. Active learning, describedin section 2.2, is appropriate for such cases in machine learning.

3.4 The active optimizer

Our optimizer is, as described above, a SVM learning algorithm which learns an optimaldecision boundary with maximum margin from training comparison vectors. Designingthe optimizer as an active learning algorithm shall learn the optimal decision boundaryfrom a relatively small training set. We are interested in the weights and the threshold ofthe linear equation of the boundary rather than using the boundary to classify previouslyunseen data. The quality of the learned parameters can only be assessed with respect tothe SimMatching algorithm. High quality weights shall make SimMatching achieve highquality matching accuracy. These requirements should be considered in order to choosethe appropriate settings of our active learning algorithm.

First of all we have to choose one of the data access scenarios presented in section 2.2.1.The comparison vectors should be generated from the original objects which are storedin database. However, the matching objects of SimMatching are already constructed andstored in memory at run-time. These objects can be utilized to generate the comparisonvectors required to learn the boundary without accessing the database once again. Thus,all candidate comparison vectors can be generated and stored in memory in order tochoose the best query among them. As a result, the pool-based scenario seems moreconvenient than the stream-based one.

Now we should choose a query selection framework from the ones presented in section2.2.2. Uncertainty sampling is simple to implement and seems closer to our problemthan the Query by committee approach. Uncertainty sampling involves learning only oneSVM boundary which holds the parameters of SimMatching. Generating a committee ofdecision boundaries is more complex, because it requires calculating distances between acomparison vector and each of the committee members. On the other hand, committeemembers may learn different values for the same parameter which requires finding a wayto combine these values into a single one.

The uncertainty measure we use in order to find the best query is the entropy as seen inequation (2.12). Since the matching (classification) problem is binary, comparison vectorsclose to the decision boundary have almost the same probability of being classified as

40

Page 49: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

positive or negative. Thus, their entropy values are high and they are more likely to bequeried. As a result, using the distance to the decision boundary d, given in equation(2.7), would be enough to find the best query.

A set X∗ of size k of comparison vectors close enough to the decision boundary isqueried, which should avoid us the risk of querying only outliers. A threshold value dmaxis used as the maximum distance to the decision boundary so that a comparison vectorcan be added to the best query. For each comparison vector x∗ ∈ X∗, which will bequeried, we have

|w · x∗ − t|‖w‖

≤ dmax (3.4)

The matching objects are already created before generating the comparison vectors.Each matching object calculates the total similarity score between the compared pair ofobjects as a linear combination of the measures’ similarity scores. As a result, the totalsimilarity score can be utilized in calculating the distance of the comparison vectorsto the optimizer decision boundary. The selectBestQuery() method in algorithm 3.4finds the best query among the matching objects and returns it as a set of unlabeledcomparison vectors. The buildComparisonVector() method builds a comparison vectorfrom a matching object simply by setting the attribute values of the comparison vectorfrom the attribute values of the matching object (except the label).

Algorithm 3.4: selectBestQuery

input : set of matching objects matchings,SimMatching parameters parameters

output: set of unlabled comparison vectors bestQuery

1 Set bestQuery;2 foreach Matching matching in matchings do3 double distance ← (matching.totalSimilarity -

parameters.threshold) / parameters.weightsNorm;4 if bestQuery.size < maxSize then5 if | distance | < maxDistance then6 ComparisonVector compVec ← buildComparisonVector

(matching);7 bestQuery.add(compVec);

8 end

9 else10 break;11 end

12 end

13 return bestQuery;

41

Page 50: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

3.5 Integrating the optimizer and SimMatching

The optimizer is an active learning algorithm which iterates maxIterations numberof times and learns new parameters (SVM boundary) in each iteration. The optimiz-er algorithm utilizes SimMatching functionality in order to construct the comparisonvectors and calculate their distances to the learned boundary. The method SimMatch-

ingFunction() is mainly responsible for creating comparison vectors as it aggregatesspatial objects and constructs the required matchings objects. Algorithm 3.5 presentsthe SimMatchingFunction() method which takes the learned parameters as input. Theparameters variable holds the threshold and the weights of the learned SVM boundary.

The method SimMatchingFunction() calculates the total similarity score of each match-ing object based on the parameters (weights). The total similarity scores are compared tothe threshold in order to confirm or reject matching objects, so that DNS scores can begreater than 0. The algorithm stores the matching objects inside the matchings variablewhich is a sorted list. The getNext() and removeNext() methods return the matchingobject with the highest total similarity score. The add() method adds a set of matchingobjects and resorts the list according to the total similarity score.

Algorithm 3.5: SimMatchingFunction; source [SL12] updated

input : SimMatching parameters parameters

output: set of matching objects matchingsReturned

1 SortedList matchings ← buildInitialMatchings(parameters);2 Set matchingsReturned;

3 while matchings.hasChanged() do4 Set aggSet ← buildAggregatedMatchings(matchings.getNext(),

parameters);5 matchings.add(aggSet);6 Matching possibleMatch ← matchings.removeNext();7 if possibleMatch.totalSimilarity > parameters.threshold then8 possibleMatch.confirm();9 matchings.cleanup();

10 matchings.recalculateSimilarities(possibleMatch, parameters);

11 end12 matchingsReturned.add(possibleMatch)

13 end

14 return matchingsReturned;

The method buildInitialMatchings() is described in algorithm 3.1. It calculates thesimilarities of the constructed matching objects according to the passed parameters.Since there are initially no confirmed matching objects, all initial DNS scores are equalto 0. The method buildInitialMatchings() does not build aggregations, because it

42

Page 51: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

is used just once at start in order to build the initial matching objects comparing linesegments. The spatial objects are aggregated and the matching objects of the aggrega-tions are constructed by the method buildAggregatedMatchings() which is describedin algorithm 3.2.

The method recalculateSimilarities() recalculates the direct neighborhood similar-ity (DNS) scores and the total similarity scores of the neighbors of the possibleMatch

object. This is necessary because confirming possibleMatch as a match changes therelational similarity of its neighbors. If possibleMatch is not confirmed, the relationalsimilarity of the neighbors remain unchanged.

The SimMatchingFunction() algorithm combines the last step of the initialization phasewith the whole iterative phase of the SimMatching algorithm (section 2.3.2). It reads thepreprocessed data from database in order to construct the matching objects. This impliesthat SimMatching must have already preprocessed the compared data sets and storedthem in database. The algorithm 3.5 relies hence on the link objects which have beenalready created and preprocessed by SimMatching. Since each link may have differentset of similarity measures associated to it, the optimizer obtains the similarity measuresfrom the link at start.

3.6 The optimizer algorithm

The algorithm 3.6 introduces the optimizer. The generateSeed() method is presentedin algorithm 3.7. We pay special attention to steps 1 - 4 of the algorithm 3.6 in section3.7. The matching objects which SimMatchingFunction() returns are the pool fromwhich the best query is selected. The labels given by SimMatchingFunction() to thematching objects are ignored, since they are assigned based on non-optimal parameters.

The selectBestQuery() method in algorithm 3.4 chooses the best k matching objectsfrom the pool. It generates and returns the corresponding unlabeled comparison vectorsin order to be queried next. The query() method assigns to the comparison vectorstheir true labels which the oracle has provided (see section 3.8). The labeled comparisonvectors are added to the training set L and a new SVM boundary is trained usingthe learnSimMatchingParameters() method, described in algorithm 3.3. The newlylearned parameters are passed to SimMatchingFunction() in the next iteration.

The design discussed above results in a major difference from the pool-based activelearning described in section 2.2.1. In usual pool-based active learning the pool is notchanging which does not apply in our case. Using the method SimMatchingFunction()

to construct the matching objects in each iteration implies that: (1) the matching objects(a subset of) are updated after every iteration because of the new similarity scorescalculated by the direct neighborhood similarity measure; and (2) neighboring spatial

43

Page 52: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Algorithm 3.6: The optimizer algorithm

1 parameters ← initializeParameters();2 L ← generateSeed(parameters);3 parameters ← learnSimMatchingParameters(L);4 balanceDNSWeight(parameters);

5 for counter ← 2 to maxIterations do6 matchings ← SimMatchingFunction(parameters);7 X∗ ← selectBestQuery(matchings, parameters);8 query(X∗);9 L ← L ∪X∗;

10 parameters ← learnSimMatchingParameters(L);

11 end

12 return parameters;

objects might be aggregated in a different way every iteration due to different learnedweights of the similarity measures in each iteration. Hence the total similarity scores ofthe matching objects involving the aggregations might be different. This may result inadding new matching objects and/or removing existing ones.

3.7 Seed generation

As shown in algorithm 3.6 an initial training set is required before active learning (theiterative part) can start. A set of labeled comparison vectors must be generated as theseed from which the initial SVM boundary is trained. This initial model shall be improvedin every iteration of active learning. As a result, we generate a simple seed of small sizethat shall train a model capable of distinguishing between “obvious” matches and non-matches. The method generateSeed() in algorithm 3.7 builds the matching objectsin a very similar way to SimMatchingFunction() except that it does not constructaggregations. This is because high- and low-similarity matching objects comparing onlyline segments are enough for training the initial boundary.

The total similarity scores, however, must be calculated in order to find obvious-to-labelcomparison vectors. This implies that the weights of the similarity measures must beinitialized so that seed generation (and the optimizer) can start. The initializePa-

rameters() method in algorithm 3.6 assigns equal weights to each similarity measureexcept DNS. The weight of DNS is set to zero, because there are no confirmed matchesor non-matches at this early stage.

DNS is not treated fairly that way because its weight will be still zero after trainingthe initial model. This means that SimMatchingFunction() in algorithm 3.6 will ag-

44

Page 53: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Algorithm 3.7: generateSeed

input : parameters which contain equal weights for all similarity measuresexcept for DNS which has a zero weight

output: the initial training set L1 Set L ← φ;2 Set highSimilarityMatchings;3 Set lowSimilarityMatchings;

4 Set matchings ← buildInitialMatchings(parameters);

5 foreach Matching matching in matchings do6 if highSimilarityMatchings.size < maxSize then7 if matching.totalSimilarity > 0.8 then8 ComparisonVector compVec ← buildComparisonVector

(matching);9 highSimilarityMatchings.add(compVec);

10 end

11 else if lowSimilarityMatchings.size < maxSize then12 if matching.totalSimilarity < 0.4 then13 ComparisonVector compVec ← buildComparisonVector

(matching);14 lowSimilarityMatchings.add(compVec);

15 end

16 else17 break;18 end

19 end20 L ← L ∪ highSimilarityMatchings;21 L ← L ∪ lowSimilarityMatchings;22 query(L);

23 return L;

gregate and classify (in its first run) without considering DNS at all. To balance thingsup, DNS shall get an initial weight after the first model has been trained. The bal-

anceDNSWeight() method in algorithm 3.6 gives DNS an initial weight of 0.1, and adaptsthe initially learned weights of the other measures accordingly.

There is another point to discuss before concluding seed generation. Another parameterinitialization is possible in order to generate the seed. Being neutral and giving equalweights to all measures does not allow to build aggregations or use DNS since the “neu-tral” parameters may be far from optimal. If previous experiences have been made withthe data sets being matched, empirical parameters which result in a good quality mayalready exist. Since these parameters have been proven to be good, they can be used

45

Page 54: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

for building aggregations and detecting matches, i.e., calculating the DNS scores in theseed generation step (in algorithm 3.6).

This second initialization requires a couple of changes in algorithm 3.6. The generate-

Seed() method is replaced with the SimMatchingFunction() method in line 2. DNSis now learned, i.e., bigger than 0 in the first trained model which makes step 4 nomore needed. However, since the seed is generated from only high- and low-similaritymatching objects, we believe that this second initialization does not bring any significantimprovements to the first trained model nor the general learning process. We choose togenerate the seed using the generateSeed() method which is more efficient in this case,because it does not aggregate, i.e. does not do any extra processing. Both initializationsare inspected in the experiments conducted on real data sets in chapter 5.

3.8 Querying

Each query X∗ consists of k comparison vectors to be labeled. Each vector refers totwo spatial objects (possibly aggregations) representing roads. The oracle shall be ableto provide the true label of the comparison vector. To do so the oracle might need toexamine the pair based on their neighbors, names, etc. The similarity scores calculatedby each similarity measure can be useful as well. Therefore this information must bemade available to the oracle. On the other hand, the oracle may not be sure what thetrue label for a vector is because of, e.g., information incompleteness. In this case, theoracle might not be able to or would rather prefer not to give a label for the vector inorder to avoid the risk of adding noisy training data.

The query() method takes charge of the querying process and labels the comparisonvectors of the best query passed to it. The k comparison vectors of the best query arelabeled one after the other and added to the training data set. The oracle is showneach similarity measure associated with its score. One of three actions can be taken:label as match, label as non-match and skip. The two spatial objects referred to by thecomparison vector are displayed to the oracle. Actually, (2∗max query size) comparisonvectors are chosen for the best query allowing the oracle to skip many times and provideonly labels she/he is sure of.

46

Page 55: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Chapter 4

Implementation

This chapter describes the detailed design and the implementation of the optimizer sys-tem. First of all, system requirements and the chosen architecture for the optimizer isexplained. Next, we present the detailed system design which complies to the designdecisions made in chapter 3. The main class diagram is presented as well. The imple-mentation comes next explaining the used libraries. The optimizer classes are introducedwith explanations of how each (important) class gets its job done.

We use the Java (version 1.7) programming language to implement our system. In orderto query, both compared road maps must be displayed to the oracle. We use an opensource Geographic Information System (GIS) called OpenJUMP (version 1.6.3) for thevisualization. The optimizer is added to OpenJUMP as a plug-in which can be launchedfrom OpenJUMP GUI. The maps are stored in a PostgreSQL database (version 9.2)with the PostGIS extension. The SimMatching project is added as a library to theoptimizer project so that the required SimMatching functionality can be utilized. Weuse RapidMiner (version 5.3), which is a free and powerful data mining and machinelearning tool, in order to learn the SVM boundary.

4.1 System requirements

Chapter 3 has introduced a solution to the parameter optimization problem defined insection 2.3.2. From this solution, we can extract the following use cases of the optimizersystem, i.e., functional requirements:

(uc1) Generate seed: the system selects a set of compared spatial objects which are verysimilar or dissimilar, builds the corresponding comparison vectors and queries theoracle for their true labels. The labeled comparison vectors are the first training

47

Page 56: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

set from which to start learning;

(uc2) Learn parameters: the system repeats the following steps a certain number oftimes: it trains a model from the current training set; then it builds and selectsthe most informative comparison vectors (according to the learned model) to bequeried. The labeled comparison vectors are added to the training set and the nextiteration starts. The learned model in the last iteration is the optimizer output;

(uc3) Query comparison vectors: the system shows the compared pair of spatial objectsclearly on screen and allows the oracle to examine their neighbors and attributevalues, e.g., street names. The oracle can decide whether to label (with “match” or“non-match”) or to skip. The system adds the labeled comparison vectors to thetraining data. The system ignores skipped, i.e., unlabeled comparison vectors anddoes not add them to the training data.

Non-functional requirements can also be extracted. These are specified as quality con-straints on how the system fulfils the functional requirements:

(q1) The system must automatically figure out the parameters to be optimized fromSimMatching configuration;

(q2) The system must learn the parameters from a small set of labeled vectors;

(q3) The system should calculate comparison vectors efficiently;

(q4) The system should query one single comparison vector at a time using GUI;

(q5) The query user interface must show each similarity measure associated with itsscore of the compared pair.

4.2 Architecture

The Model-View-Controller (MVC) pattern is chosen for our optimizer. The MVC pat-tern is very suitable to systems with user interfaces, because it separates the presentationfrom domain models. This separation allows for flexible and easy-to-change system com-ponents. The MVC pattern, as the name suggests, divides a system in three components:model, view and controller. Each component is made as independent as possible from theothers. The model is the application data, which includes domain entities, applicationlogic and functions. The view which is the presentation, i.e., the components of the userinterface. The controller defines the interaction between user input and the view. Thecontroller sends commands to the view to change the view’s presentation of the model.It also sends commands to the model to update the model’s state. The model may alsonotify the view when its state has changed.

48

Page 57: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Two Java packages in the optimizer contain the model: model and learner. The viewand the controller are contained inside the package gui. Table 4.1 introduces all packagesof the optimizer.

Package Descriptionmodel This package contains the optimizer entities which are: the

comparison vectors and the SVM model.

learner This package contains the active learning algorithm of theoptimizer and a passive algorithm which learns a SVM modelfrom a set of labeled comparison vectors.

gui This package contains the optimizer user interfaces which arethe application main interface and the query dialogue box.

util This package contains helpful tools for other classes especiallyto deal with SimMatching parameters.

jump This package contains required classes in order to combine theoptimizer with OpenJUMP which allows to clearly show thecompared objects of the queried comparison vectors on thescreen.

manager This package contains classes managing the database accessand the application in general.

simmatching This package provides SimMatching functionality required bythe optimizer.

Table 4.1: The optimizer packages

4.3 Detailed system design

Figure 4.1 is a class diagram of the optimizer main classes. The ComparisonVector classdefines the structure of the comparison vectors as explained in section 3.2. Each objectof this class compares two spatial objects from two different maps. The SVMModel classrepresents a SVM model which consists of the weight vector and the threshold of thelearned linear classifier. The class SVMLearner implements a SVM passive learning al-gorithm which returns the learned model as an object of type SVMModel. The classSVMActiveLearner implements a SVM active learning algorithm which uses pool-baseduncertainty sampling active learning (as described in 3.6). The SVMActiveLearner classutilizes the SVMLearner class to train an object of SVMModel in each iteration.

49

Page 58: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

An object of QueryGUI class is a dialogue box shown on the screen asking to label oneobject of type ComparisonVector. The oracle can give the true label or skip by clickingone of the three buttons available: match, non-match and skip. The OptimizerGUI classis the application main interface. It displays the learned parameters in addition to logmessages describing the operations done by various classes of the optimizer. The Op-

timizerManager class adheres to the singleton design pattern and its only instancemediates the interaction between the OptimizerGUI and other classes. Another single-ton class, which is the TrainingDataManager, controls adding and obtaining labeledComparisonVector objects to/from the training set.

Figure 4.1: Class diagram of important classes.

50

Page 59: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

4.4 Implementation

4.4.1 Database

The optimizer is used each time to learn the optimal parameters for one SimMatch-ing Link object, i.e., two road maps which have been preprocessed. Our Comparison-

Vector objects can be easily constructed from the SimMatching Matching objects. TheComparisonVector class (see figure 4.2) has only one extra attribute which is the label

to be specified by the oracle if the object was chosen for querying. The label attributeis chosen of type int and can have one of two values +1 (match) or -1 (non-match). Thevalue 0 indicates that no label has been given to this object (yet). The ID-references tothe compared objects are used to select their (preprocessed) features and show them onscreen so that the oracle can take the proper action.

Figure 4.2: The ComparisonVector class.

The TrainingDataManager maintains the ComparisonVector objects in database. Thetraining set is stored in a database table called linkName_training_instance. Thetraining table columns correspond to the ComparisonVector class attributes. There isa training table for each SimMatching Link object to be optimized. The Training-

DataManager class provides all required operations for this table: create, drop, insertand select. Objects of the SVMLearner class (and the SVMActiveLearner class) use all

51

Page 60: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

table data to train the desired SVMModel objects. The table is first created before theseed is generated (old table will be dropped if exists). After querying, only the labeledcomparison vectors are inserted into this table.

Figure 4.3: The TrainingDataManager class (public view only).

4.4.2 Querying

In order to show the compared spatial objects on screen, we use OpenJUMP to visualizeboth preprocessed maps. The package jump has two classes which cause OpenJUMP torun when the optimizer starts. The optimizer can be launched then as a plug-in from theOpenJUMP GUI menu bar. The constructor of OptimizerGUI uses SimMatching func-tionality in order to show the maps which are compared in the SimMatching link objectbeing optimized. OpenJUMP has a variety of tools allowing to examine any spacial ob-ject thoroughly. One great tool is the info tool which allows to select spatial objects andthen pops up an info box displaying information about the selected objects. Figure 4.4shows an example of querying one comparison vector. Figure 4.5 shows the info tool forthe compared spatial objects.

Each query is a set of size 2 ∗ k of ComparisonVector objects to be labeled by theoracle, where k is the maximum number of objects to be labeled in the query. TheComparisonVector objects of the query are shown to the oracle one after the other.The map is zoomed onto the compared objects which are automatically selected andhighlighted for the oracle (see figure 4.4). The query dialogue box is not modal allowingto completely use the parent OpenJUMP GUI for examining the spatial objects. If alabel was given, the ComparisonVector object is added to the training data set using theTrainingDataManager class. The query ends when k ComparisonVector objects havebeen labeled or when there are no more unlabeled ComparisonVector objects to queryas shown in activity diagram 4.6.

52

Page 61: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Figure 4.4: Querying with OpenJUMP.

Figure 4.5: Querying with OpenJUMP - the info tool.

53

Page 62: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Figure 4.6: Activity diagram query(): querying a set of ComparisonVector objects.

4.4.3 Seed generation

A seed must be generated before active learning can start. The Seed generation processadds the first labeled ComparisonVector objects to the training data from which thefirst SVMModel is trained at the first iteration of active learning. At this point, as wehave explained in section 3.7, SimMatching functionality is not used yet. Thus, there areno already created Matching objects and we have to construct the comparison vectors“manually”. We choose to label k ≈ 40 ComparisonVector objects for our seed. Thepartitions defined by SimMatching functionality is utilized here. We choose s partitionsfrom the maps randomly and obtain (2 ∗ k)/s ComparisonVector objects from each oneof them. This results in having 2 ∗ k objects to be queried, i.e., extra instances allowingthe oracle to skip if needed.

54

Page 63: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Actually, the seed generation process for each partition is inherited from the matchingprocess of SimMatching. The class PartitionSeedQuery utilizes SimMatching function-ality to obtain the ComparisonVector objects from one partition. It builds SimMatch-ing Matching objects in order to benefit from their Similarity attribute. Only (2∗k)/sobjects are picked: half of them with high total similarity scores (higher than 0.8) and theother half with a small total similarity scores (between 0.2 and 0.4). No more Matching

objects are built after that and (2∗k)/s ComparisonVector objects are constructed andreturned. Figure 4.7 is an activity diagram illustrating the seed generation process andthe work done by PartitionSeedQuery objects.

Figure 4.7: Activity diagram generateSeed(): creates a set of labeled ComparisonVec-

tor objects from which to start active learning. This diagram is very similar to theselectBestQuery() diagram 4.17.

55

Page 64: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

4.4.4 SVM learning using RapidMiner

The train() method of the SVMLearner class trains the SVM decision boundary of theoptimizer. It is implemented with the help of RapidMiner. RapidMiner is an open-sourcesoftware implemented in Java which can be utilized from:

(1) the RapidMiner application, i.e., GUI;

(2) the RapidMiner API.

For (1) we need to install the RapidMiner application and create a RapidMiner process.RapidMiner offers many operators to achieve a wide range of tasks, e.g., SVM learning(see appendix A). For (2) the RapidMiner_Unuk project must be checked out from theRapidMiner repository and added to the build path of our project. The Unuk project hasthe most recent implementation of the RapidMiner application. In the optimizer, bothways are used.

RapidMiner GUI is used first to create a RapidMiner process which learns and returnsa SVM decision boundary. Figure 4.8 illustrates the main process which was createdto learn the SVM decision boundary. The operators are executed sequentially from leftto right. The last three operators are the ones which learn the SVM boundary. The“Read Database” operator is used to load the training comparison vectors from thelinkName_training_instance table. The “SVM” operator, which is based on the inter-nal Java implementation of mySVM by Stefan Rueping, is used with the linear kernel inorder to train a linear decision boundary.

Figure 4.8: The RapidMiner process which learns the SVM decision boundary.

Since this operator learns a soft-margin decision boundary in general, we are faced withthe complexity parameter C which needs to be set. Finding an optimal value for Cis crucial for SVM training. Thus, we implement a cross-validation process (the firstfour operators to the left in Figure 4.8) before actually learning the model. This pro-cess finds out the best value for C and assign it to the “SVM” operator named “SVMWeights Learner”. The “Optimize Parameters (Grid)” operator iterates over manuallydefined values for the complexity parameter. According to RapidMiner implementation,the “Optimizer Parameters (Grid)” operator has a sub process which should return aperformance score (or more) at the end of each iteration. Figure 4.9 illustrates ours.

56

Page 65: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Figure 4.9: The sub process of the “Optimize Parameters (Grid)” operator.

The“Validation”operator in figure 4.9 performs 5-fold cross-validation (see cross-validationin section 2.1.2). RapidMiner specifies that the “Validation” operator has two sub pro-cesses: one for training the model from n− 1 folds (subsets from the training data), andthe other is for testing the learned model on the remaining fold. Figure 4.10 illustratesthe training sub process which simply trains a SVM model based on one of the C valuesspecified in the “Optimizer Parameters (Grid)”. The testing sub process in figure 4.11assesses the performance of model which was learned in the training sub process. It re-turns the accuracy on the fold which was not used for training.

Figure 4.10: The training sub process of the “Validation” operator.

Figure 4.11: The testing sub process of the “Validation” operator.

57

Page 66: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

The “validation” operator is executed once for each value of C specified inside the “Opti-mize Parameters (Grid)” operator. In each execution it trains 5 models and reports theircross-validation accuracy to the parent“Optimize Parameters (Grid)”operator. The gridoperator returns the value which results in the optimal (highest) accuracy as its output.The “Set Parameters” operator sets the C value of the “SVM Weights Learner” to theoptimal one returned by the “Optimize Parameters (Grid)” operator.

RapidMiner automatically generates a XML-file describing each process created using theGUI. According to RapidMiner (version 5.3) documentation, this file should be copiedto the project and passed to a com.rapidminer.Process object in order to create thesame process using the RapidMiner API. By calling the run() method of this objectthe process is run and the outputs are returned inside an IOContainer object. Theboundary threshold and the weights are extracted using the ParameterExtractor classof the optimizer. The extracted parameters are wrapped within a SVMModel object andreturned. Figure 4.12 illustrates learning a single SVMModel using the SVMLearner.

Figure 4.12: Activity diagram trainSVM(): learns the SVM decision boundary as aSVMModel object.

58

Page 67: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

4.4.5 Active learning

The pool-based uncertainty sampling active learning used in the optimizer is conductedby the SVMActiveLearner class. An SVMActiveLearner object constructs and maintainsone SVMModel object in each iteration. The object is basically returned by the train()

method of the SVMLearner class. An object of SVMModel, after normalization, representsthe learned SimMatching parameters. Figure 4.13 illustrates the SVMModel, SVMLearn-er and SVMActiveLearner classes.

Figure 4.13: Class diagram: SVMModel, SVMLearner and SVMActiveLearner.

At the start of each iteration, the optimizer learns SimMatching parameters as object ofSVMModel using the SVMLearner and the ParameterNormalizer classes. The optimizer’sParameterNormalizer class normalizes the parameters of the SVMModel object to con-form with SimMatching standard, i.e., all values in [0,1] and the sum of the weights is 1.It balances the weight of DNS only after training the first SVMModel object, as describedin section 3.7. Figure 4.14 illustrates the active learning process.

59

Page 68: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Figure 4.14: Activity diagram: active learning of SimMatching parameters.

The optimizer relies on SimMatching functionality to build the comparison vectors.The SimMatchingWorker class of SimMatching, presented in figure 4.15, is used forthis purpose. We have implemented an “optimize mode” in it which does not effect itsoriginal functionality. When called from the optimizer, it constructs aggregations andbuilds the Matching objects (as specified in section 3.5). The PartitionSeedQuery classin the optimizer has almost the same structure but it omits any functionality regardingaggregations (see figure 4.16).

60

Page 69: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Figure 4.15: The SimMatchingWorker class of SimMatching.

Figure 4.16: The PartitionSeedQuery class of the optimizer.

61

Page 70: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

The Similarity attribute in the created Matching objects is utilized once again tofind the best query to ask to the oracle (section 3.4). If the similarity value is close tothe boundary threshold, a ComparisonVector object is constructed from the Matching

object and added to the best query. This is done for r randomly chosen partitionsobtaining (2 ∗ k)/r ComparisonVector objects from each one of them, where k is themaximum number of ComparisonVector objects to be labeled in one query. Figure 4.17illustrates how the best query is selected.

Figure 4.17: Activity diagram selectBestQuery(): finding the best query.

At the end of each active learning iteration, SimMatching is reconfigured using thelearned SVMModel object. This reconfiguration is required since the SimMatchingWorker

class is the one building the ComparisonVector objects indeed and not the SVMActive-

Learner. The SimMatchingConfig class (see figure 4.18) from SimMatching is used

62

Page 71: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

to configure SimMatching in each iteration, so that the SimMatchingWorker objectscalculate the distances of the Matching objects to the most recently learned decisionboundary. This class is also used to retrieve the similarity measures associated with eachlink before the optimization starts.

Figure 4.18: The SimMatchingConfig class of SimMatching.

63

Page 72: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has
Page 73: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Chapter 5

Results

This chapter explains the experiments conducted on the optimizer and evaluates theresults, i.e., the learned parameters with the help of SimMatching. First of all, the datasets of the experiments are introduced. Three geographical data sets (maps) of Munichare matched. The maps stem from three different sources: ATKIS, D20 and OSM. Thedata sets have been obtained by preprocessing the original map files using SimMatching.Three links have been created in order to test:

• Link 1: matches ATKIS and D20 preprocessed data sets of Munich;

• Link 2: matches D20 and OSM preprocessed data sets of Munich;

• Link 3: matches ATKIS and OSM preprocessed data sets of Munich.

The SimMatchingFunction method (see section 3.5) has other parameters (not to belearned nor optimized) which need to be configured. Most of these parameters are as-signed their default values introduced in table 5.1. For example, the candidate thresholdparameter which is used to immediately throw out matching objects that do not havethe chance to become matches at all.

Parameter ValueThreadCount 4CandidateBuffer 25CandidateThreshold 0.1HorizontalPartitionCount 15VerticalPartitionCount 15

Table 5.1: Other SimMatching parameters and the values assigned to them in our tests.There parameters are not learned nor optimized.

65

Page 74: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

In addition to these parameters, four constraints have been added to SimMatching whichdefine the minimum length, start point, end point and angle of the candidate matches.One rule is also configured which checks the sum of the similarities of a matching object.More information about these parameters are presented in [SL12]. Only the similaritythreshold and the weights of the similarity measures are optimized through learning.

The parameters learned by the optimizer after each experiment are evaluated in twoways: (1) the learned parameters (threshold and weights) are compared with the empiri-cal ones which have already proven to bring good results. Table 5.2 shows the similaritymeasures used and their empirically set values for Link 1 and Link 2. The parameterslearned from each experiment are compared with ones in this table and discussed. ForLink 3 the object type similarity is not used and the start point similarity gets a weightof 0.2 instead of 0.1.

Parameter ValueSimilarityThreshold 0.5

AngleSimilarityWeight 0.1EndPointSimilarityWeight 0.1LengthSimilarityWeight 0.25NameSimilarityWeight 0.1ObjectTypeSimilarityWeight 0.1StartPointSimilarityWeight 0.1

DirectNeighborhoodSimilarityWeight 0.25

Table 5.2: The empirical SimMatching parameters which have proven to be the best onesyet, because they make SimMatching achieve high quality matching results.

In table 5.2, the direct neighborhood similarity measure, which is the only relationalsimilarity measure, has a weight of 0.25. The other attribute similarity measures havea weight of 0.1 each, except of the length similarity measure which has a weight of0.25. The similarity threshold parameter has a value of 0.5 which means that the lengthand the direct neighborhood similarity measures, based on previous experiences, have adecisive role in confirming the matching objects.

The second evaluation way (2) is to configure SimMatching with the learned parameters(and the ones in table 5.1) and start it in order to evaluate its matching results. Thematching results (matches and non-matches) of SimMatching are examined visually,because a reference test set is not available. Certain areas of the maps are manuallychecked and compared against the matching results of the empirically set parameters.The number of detected matches and non-matches is also compared.

66

Page 75: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

5.1 The experiments on Link 1

5.1.1 Equal initial weights for seed generation

Ten models are learned and up to 100 comparison vectors are labeled and added to thetraining data in each round of active learning. The initial weights used in seed generationare equal as shown in table 5.3. The weight of the direct neighborhood similarity is zero,as specified in section 3.7.

Parameter ValueSimilarityThreshold 0.5

AngleSimilarityWeight 0.167EndPointSimilarityWeight 0.167LengthSimilarityWeight 0.167NameSimilarityWeight 0.166ObjectTypeSimilarityWeight 0.166StartPointSimilarityWeight 0.167

DirectNeighborhoodSimilarityWeight 0.0

Table 5.3: The initial parameters used in seed generation.

The seed generation step labels 40 comparison vectors to be the initial training set, i.e,the seed. The first SVM model learned from this seed is illustrated in table 5.4. Theweight of the direct neighborhood similarity is set to 0.1 (see section 3.7). We can seethat the weight of the length similarity is very low and far from the empirical one intable 5.2. The object type measure has the highest weight.

The fifth learned SVM model is given in table 5.4 as well. The threshold increases to0.6 which means theoretically that less matches are confirmed. This can justify whythe direct neighborhood similarity is still just above its start value. The object typesimilarity drops to 0.9. The name similarity decreases dramatically, while the weight ofthe length similarity is almost equal to its empirical value. The angle, the start pointand the end point similarity measures have almost equal weights which are above theirempirically set ones.

The parameters learned in the last iteration of active learning are given in the lastcolumn of table 5.4. These have been trained from 796 training comparison vectors: 535labeled as “match” and 261 labeled as “non-match”. The length similarity measure getsthe highest weight which is close to its empirical weight. It is not very different from theweight learned in the fifth iteration which means that the importance of this measure issignificant and can be seen from early trained models.

67

Page 76: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Parameter Value (1) Value (5) Value (10)SimilarityThreshold 0.55 0.6 0.49

AngleSimilarityWeight 0.07 0.16 0.16EndPointSimilarityWeight 0.15 0.18 0.12LengthSimilarityWeight 0.09 0.24 0.22NameSimilarityWeight 0.14 0.03 0.06ObjectTypeSimilarityWeight 0.31 0.09 0.11StartPointSimilarityWeight 0.14 0.19 0.13

DirectNeighborhoodSimilarityWeight 0.1 0.11 0.2

Table 5.4: The parameters learned in the first iteration (from the seed), in the fifthiteration and in the last iteration of active learning for Link 1.

The name similarity has the smallest weight among all the weights with 0.06. The reasonfor this may be the used string similarity measure which returns 1 if and only if bothstrings are totally equal. A single character difference between the two results in asimilarity score of zero. Using a different string similarity measure may contribute moreto the matching process and might be even decisive in detecting matches.

The object type similarity gets almost its same empirical weight at the end althoughit was considered very important after the first iteration. The weights of the angle,the start point and the end point similarity measures have not largely changed from thefifth iteration. However, they are still above their empirical values. The above mentionedattribute similarity measures have a total sum of 0.8. The threshold gets a value of 0.49at the end which is almost the same as its empirically set one.

The direct neighborhood similarity (DNS) ends up with a weight of 0.2. In the lastfive iterations, however, its weight is gradually increasing. The weights learned for DNSfrom the sixth to the tenth iterations are respectively: 0.12, 0.13, 0.15, 0.16 and 0.2. Thereason may be that the learned weights of the other similarity measures are improvingsignificantly in the last iterations. As a result, more matching objects are confirmedcausing the DNS scores of their neighbors to be increased. When the comparison vectorsbuilt from such matching objects (with very high DNS scores) are queried and labeled,they are most likely to be classified as “match”. This leads to more training comparisonvectors with high DNS scores which are labeled as “match”, and thus DNS plays moreand more a bigger role in detecting matches.

SimMatching, configured with the parameters shown in the last column of table 5.4, hasconfirmed 52693 matching objects and rejected 37785 ones. Figure 5.1 shows an exampleof these results where the learned parameters leads to high quality. Figure 5.2 shows thesame example matched using the empirically set parameters presented in table 5.2 whereSimMatching has confirmed 52002 objects and rejected 40002 ones.

68

Page 77: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Figure 5.1: An example of SimMatching results for Link 1 using the learned parametersfrom table 5.4. The matches are in green, ATKIS objects with no match found are inred and non-matched D20 objects are in blue. The small circles indicate the end pointsof line segments.

Figure 5.2: An example of SimMatching results for Link 1 using the empirical parametersfrom table 5.2. The matches are in green, ATKIS objects with no match found are inred and non-matched D20 objects are in blue. The small circles indicate the end pointsof line segments.

69

Page 78: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

5.1.2 Empirical initial weights for seed generation

We have discussed two ways in section 3.7 to generate the seed using two different settingsof the initial weights of the similarity measures. The first way is to give each similaritymeasure the same weight. The second way is to use empirically set initial weights whichhas proven to bring good matching results. We argued that there is no difference sincewe are generating only “obvious” matches and non-matches vectors to be labeled andadded to the seed. In this test, we investigate generating the seed from the empiricallyset parameters in table 5.2.

As in the previous test the seed consists of 40 labeled comparison vectors and up to 100ones are labeled in each iteration of active learning. Table 5.5 shows the first, the fifthand the last learned parameters. DNS has a higher weight in the first model than thefixed one given to it previously, because it is now learned. The last model does learnsthe length similarity better than the previous test, but the DNS weight, on the otherhand, is worse. The threshold value is very high is the first learned model.

Parameter Value (1) Value (5) Value (10)SimilarityThreshold 0.62 0.59 0.54

AngleSimilarityWeight 0.15 0.33 0.24EndPointSimilarityWeight 0.18 0.1 0.11LengthSimilarityWeight 0.22 0.19 0.26NameSimilarityWeight 0.08 0.03 0.05ObjectTypeSimilarityWeight 0.08 0.06 0.05StartPointSimilarityWeight 0.14 0.17 0.14

DirectNeighborhoodSimilarityWeight 0.15 0.12 0.15

Table 5.5: The parameters learned in the first iteration (from the seed), in the fifthiteration and in the last iteration of active learning for Link 1 when using empiricalweights for seed generation.

Nevertheless, the queries does not show any remarkable difference in the learned modelsconsidering how much “difficult” the queries are. The first learned model is still “naive”and does not seem to be better than the one learned in section 5.1.1. This is due tothe fact that compared pairs which are very similar or very dissimilar have high andrespectively low total similarity values regardless of the weight combinations of the sim-ilarity measures or the constructed aggregations. Thus, the comparison vectors whichare added to the seed in both ways have very much in common and result in the firsttrained models to be almost of the same quality.

70

Page 79: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

5.2 The experiment on Link 2

Ten models are learned and up to 100 comparison vectors are labeled and added to thetraining data in each round of active learning. The initial weights used in seed generationsare equal as shown in table 5.3. The weight of the direct neighborhood similarity is zero,as specified in section 3.7. The SVM models learned in the first, fifth and tenth iterationof active learning are presented in table 5.6. In the first model which was learned from40 training instances, the direct neighborhood similarity (DNS) has a weight of 0.1 andthe similarity threshold increases to 0.61. The name similarity has the highest weightsamong all measures.

Parameter Value (1) Value (5) Value (10)SimilarityThreshold 0.61 0.64 0.57

AngleSimilarityWeight 0.13 0.4 0.22EndPointSimilarityWeight 0.14 0.06 0.15LengthSimilarityWeight 0.14 0.22 0.23NameSimilarityWeight 0.22 0.04 0.06ObjectTypeSimilarityWeight 0.11 0.06 0.06StartPointSimilarityWeight 0.16 0.11 0.12

DirectNeighborhoodSimilarityWeight 0.1 0.11 0.16

Table 5.6: The parameters learned in the first iteration (from the seed), in the fifthiteration and in the last iteration of active learning for Link 2

The fifth learned model has much in common with one for link 1. The weights of thename and object type similarities drop once again and the threshold increases. The lengthsimilarity weight is high and close to the empirically set one while the DNS weight isaround it initial value. The highest weight, however is assigned to the angle similaritymeasure. However, low weights are given to the start and end point measures. This modelcan be clearly seen relatively far from reality as two roads far from each other may beexactly parallel and have the same length but still are non-match.

The parameters learned for link 2 in the last iteration of active learning are presentedin the last column of table 5.6. The parameters have been learned from 940 trainingcomparison vectors: 557 labeled as “match” and 383 labeled as “non-match”. The lengthsimilarity weight is very close to its empirically set value (see table 5.2) and close to itsweight in the fifth iteration. Once again this result conforms with the result of the firstexperiment on link 1 (see section 5.1.1) and stresses out the importance of the lengthmeasure in the matching process.

The weight of the name similarity is low as explained in test 5.1.1. The object typesimilarity measure has a weight of 0.06. However, its weight that was reported in test

71

Page 80: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

5.1.1 was around the empirical value. This might be due to the nature of the object typesimilarity measure which only considers the type of the compared objects regardless oftheir positioning on the map. Matches and non-matches which are very clear to identifymay have the same object type similarity measure which makes it less useful to thematching process.

The angle similarity has a high weight this time. The start point and end point simi-larities are getting weights above the empirical ones. These three measures have a totalweight sum of 0.49 and play thus a major role in detecting matches. The DNS weightincreases in the last five iterations to 0.16 which conforms with the observation made insection 5.1.1. The DNS weight, however, is lower and relatively far from the empirical-ly set one. This may be due to the high threshold value of 0.57 causing less matchingobjects to be confirmed.

SimMatching, configured with the parameters shown in the last column of table 5.6, hasconfirmed 53515 matching objects and rejected 71253 ones. Figure 5.3 shows an exampleof these results where the learned parameters leads to high quality. Figure 5.4 shows thesame example matched using the empirically set parameters presented in table 5.2 whereSimMatching has confirmed 53324 objects and rejected 77152 ones.

Figure 5.3: An example of SimMatching results for Link 2 using the learned parametersfrom table 5.6. The matches are in green, OSM objects with no match found are inorange and non-matched D20 objects are in blue. The small circles indicate the endpoints of line segments.

72

Page 81: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Figure 5.4: An example of SimMatching results for Link 2 using the empirical parametersfrom table 5.2. The matches are in green, OSM objects with no match found are in orangeand non-matched D20 objects are in blue. The small circles indicate the end points ofline segments.

5.3 The experiment on Link 3

Similart to the previous tests, ten models are learned and up to 100 comparison vectorsare labeled and added to the training data in each round of active learning. However,the object type similarity measure can not be used with this kind of links. The initialweights used in seed generations are equal as shown in table 5.7. The seed generationstep labels 40 vectors to be the initial training set from which to learn the first model.

Parameter ValueSimilarityThreshold 0.5

AngleSimilarityWeight 0.2EndPointSimilarityWeight 0.2LengthSimilarityWeight 0.2NameSimilarityWeight 0.2StartPointSimilarityWeight 0.2

DirectNeighborhoodSimilarityWeight 0.0

Table 5.7: The initial parameters used in seed generation for link 3.

73

Page 82: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

The first, fifth and last learned SVM models are illustrated in table 5.8. The first modelhas almost equal weight for each similarity measure with the names measure gettingthe highest weight. In the fifth model, however, the name drops to 0.1. The lengthsimilarity measure, as in previous tests, gets a weight around its empirically set value.Each of the start point and the end point similarity measures has almost the sameweight. Their weights are relatively high which might explain why the weight of the anglesimilarity measure is low. The threshold drops from 0.64 to 53 causing more matches tobe confirmed and thus, as discussed in 5.1.1, the DNS weight is high here.

Parameter Value (1) Value (5) Value (10)SimilarityThreshold 0.64 0.53 0.52

AngleSimilarityWeight 0.16 0.07 0.16EndPointSimilarityWeight 0.17 0.20 0.19LengthSimilarityWeight 0.19 0.22 0.18NameSimilarityWeight 0.21 0.1 0.1StartPointSimilarityWeight 0.17 0.21 0.19

DirectNeighborhoodSimilarityWeight 0.1 0.2 0.18

Table 5.8: The parameters learned in the first iteration (from the seed), in the fifthiteration and in the last iteration of active learning for Link 3.

The parameters learned for link 3 in the last iteration of active learning are presentedin the last column of table 5.6. 912 training comparison vectors was used for learning:568 labeled as “match” and 344 labeled as “non-match”. The model which is learnedin the last iteration has a threshold value close to the empirically learned one as doesthe name similarity measure as well. All other similarity measures get almost the sameweight now.

This remarkable result may come from the following observation made on the data set.The ATKIS and OSM data sets follow similar modelling rules in order to representroads as closer to reality as possible. In addition, The ATKIS and OSM maps of Munichoverlap largely. The matched spatial objects are thus similar to each other according toseveral aspects. As a result, there is no similarity measure which is more useful than theothers in detecting matches, i.e., all the similarity measure are equally useful in detectingmatches.

SimMatching, configured with the parameters shown in the last column of table 5.8, hasconfirmed 77806 matching objects and rejected 13759. Figure 5.5 shows an example ofthese results where the learned parameters leads to high quality. Figure 5.6 shows thesame example matched using the empirically set parameters presented in table 5.2 whereSimMatching has confirmed 77162 objects and rejected 14825 ones.

74

Page 83: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Figure 5.5: An example of SimMatching results for Link 3 using the learned parametersfrom table 5.8. The matches are in green, ATKIS objects with no match found are in redand non-matched OSM objects are in orange. The small circles indicate the end pointsof line segments.

Figure 5.6: An example of SimMatching results for Link 3 using the empirical parametersfrom table 5.2. The matches are in green, ATKIS objects with no match found are in redand non-matched OSM objects are in orange. The small circles indicate the end pointsof line segments.

75

Page 84: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has
Page 85: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Chapter 6

Conclusion

Active learning of SVM is utilized in order to learn the parameters of SimMatching.The optimizer chooses the best comparison vectors for training according to their totalsimilarity score. The optimizer constructs comparison vectors which compare 1:1 aggre-gations of line segments. The results of the experiments show that the weights of thelength, angle, start point and end point similarity measures can be learned efficiently.However, the optimizer should be adapted to the properties of the direct neighborhoodsimilarity in order to learn its weight more properly.

The results show clearly that the length measure is the most useful similarity measurefor in general. Some data set (maps) are adjusted, so that they can be well printed onpaper and nicely presented on screen, e.g., D20. When these data sets are comparedwith maps representing the real road network exactly as is, similar areas between themaps may not overlap properly. The matched objects may be shifted or rotated fromeach other which causes the angle, start point and end point similarity measures to beless useful. The lengths of the roads remain unchanged, however, which makes the lengthsimilarity very useful in identifying the matched objects that are relatively far from ornot parallel to each other.

The learned weights of the direct neighborhood similarity (DNS) are below the empiricalexpectations, especially in the second test in section 5.2. This is due to the appliedlearning approach which does not have enough “context” for learning the weight of DNS.More precisely, the way in which comparison vectors are constructed is the reason behindthis. Using SimMatching functionality as explained in chapter 3 allows a line segmentto be part in one comparison vector at most. This does not allow the training data setto reflect the nature of DNS. DNS can identify the true match between many matchcandidates which are equally similar

In order to learn the weight of DNS more efficiently, context information should beincluded in the training set. Many comparison vectors, which refer to the same line seg-

77

Page 86: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

ment, should be constructed, queried and added. If there are enough matched neighbors,the comparison vector of the true match is the only one with a high DNS score. Thisvector would be labeled hence as “match” and the other comparison vectors, which havelow (or even zero) DNS scores, would be labeled as “non-match”. Consequently, a big-ger proportion of the training comparison vectors which are labeled as “match” wouldhave high DNS scores. On the other hand, more comparison vectors which are labeledas “non-match” would have low DNS scores. As a result, the trained model shall have ahigh DNS weight.

We must make sure, however, that the active learner queries the above constructedcomparison vectors so that they can be added to training data. They might not beconsidered useful training instances according to the active learner. In addition, suchqueries may not be considered helpful by human oracles, either. This can be solved, forexample, by automatically labeling and adding these comparison vectors after the oraclehas confirmed a comparison vector with a high DNS score as a “match”.

Two automatic ways to generate the seed have been investigated: using equal initialweights and exploiting empirical weights which have previously proven to be good. Wedid not notice any big difference between them in the experiments we made (see section5.1.2). However, the start training set can be obtained by constructing and labeling asmall set of comparison vectors manually. Since we can visually find a plenty of highlysimilar and dissimilar line segments inside the compared subsets, constructing and label-ing a small set of comparison vectors can be done with a limited effort. The importantthing to remember here is to provide both matches and non-matches.

Future work can investigate, in addition to DNS weight learning, active learning relatedissues. For example, what effects the number of iterations has on the quality of thelearned models. The number of labeled comparison vectors in each iteration of activelearning can be tested as well. The proportion of matches to the non-matches in thetraining data can be examined. Better ways may exist to stop active learning based onthe quality of the learned model instead of defining a maximum number of iterations.For instance, the learning algorithm may compare the newly learned parameters to theprevious ones and stop if no big differences exist.

78

Page 87: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Appendix A

RapidMiner Application GUI

RapidMiner processes are used to fulfil data mining and machine learning tasks (andother tasks, e.g., data preprocessing). Each process consists of operators which are con-nected to each other. Processes can be easily created using the RapidMiner applicationGUI but they are internally represented using XML. RapidMiner allows defining break-points to examine processes in detail. It displays error messages and suggests solutions toproblems at process creation time. That is why creating processing with RapidMiner canbe seen as a sort of programming using GUI.

Figure A.1: RapidMiner GUI

79

Page 88: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

RapidMiner GUI is divided into perspectives and views. The processes are created insidethe design perspective and their results are shown in the result perspective. The two mostimportant views inside the design perspective are: the operators view and the processview. Operators can added to processes by dragging and dropping from the operatorsview to the process view. Figure A.1 shows the design perspective. The XML describingthe designed process is generated automatically inside the XML view.

RapidMiner contains a wide range of operators for almost every task, e.g., operators forinput, output, modelling, etc. The data flows between operators through the connectionsdefined between them. RapidMiner introduces the concept of ports for connecting oper-ators. Operators have input port(s) and output port(s). Each operator performs usuallyone specific task, e.g., learning a model on the inputs in order to provide its output. Theway the task is done can be configured through the parameters of the operator. FigureA.1 shows the “SVM” operator (left).

Figure A.2: The SVM operator (left) and its parameters (right)

The parameters of the “SVM” operator (right) are displayed in the parameters view.The kernel type parameter is set to dot which specifies that the trained SVM boundaryshould be linear. The C parameter is the complexity parameter which is set inside ouroptimizer through cross-validation.

Operators can be nested as well. Some operators have one or more sub-processes. Opera-tors can be added to these sub-processes as usual. The input of the operator is providedto the sub-process which in can pass its output data to the parent operator output (Theprocesses have input and output ports). A small icon at the bottom of the operatorbox indicates that the operator has a sub-process which should be implemented. Doubleclicking the operator opens the sub-process in the process view. Figure A.3 shows the“Validation” operator.

80

Page 89: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Figure A.3: The cross validation operator

Figure A.4 illustrates the two sub-processes of the “Validation” operator. The first oneis called the training sub-process which should train a model on n− 1 folds. The trainedmodel can be passed to testing sub-process through the model port (mod). The testingsub-process should return the performance of the previously trained model measuredon the remaining fold. These steps are repeated n times, where n is the number offolds specified inside the parameters of the “Validation” operator. The average of the nperformances is usually returned as the output of the operator.

Figure A.4: The two sub-processes of the cross validation operator

81

Page 90: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has
Page 91: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Appendix B

SimMatching Related Classes

We present here the classes of SimMatching which are highly relevant to the optimizerbut were not presented in the implementation in chapter 4. Two terms, which correspondto two classes, have been inherited from SimMatching and used inside the optimizer: thematching objects and the link objects. The CDBLink class (figure B.1) has two Compo-

nentDB classes from which to obtain the preprocessed spatial objects (features) of thedatabases being matched. The Matching class (figure B.2) has two Feature attributeswhich are the compared spatial objects. The calculateSimilarity() method calcu-lates the total similarity score of the object as a weighted sum of the similarity measuresassociated with the link.

Figure B.1: The CDBLink and ComponentDB classes of SimMatching.

83

Page 92: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Figure B.2: The Matching class of SimMatching.

We present next three other classes: the Visualizer, the PreprocessManager and theDataProvider. We have slightly changed these classes by adding methods which arerequired for the optimizer. In the Visualizer, we have implemented selecting, zoomingand highlighting the two compared spatial objects on the map in order to query them. Aqueried spatial object can be a line segment, so the DataProvider is used to obtain it. Ifthe spatial object is aggregated from more than one segment, the PreprocessManager

class is used to get it.

84

Page 93: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Figure B.3: The Visualizer class of SimMatching.

Figure B.4: The PreprocessManager class of SimMatching.

85

Page 94: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Figure B.5: The DataProvider class of SimMatching.

86

Page 95: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

List of Figures

2.1 Supervised machine learning; source [Fla12]updated . . . . . . . . . . . . 8

2.2 Unsupervised machine learning; source [Fla12] . . . . . . . . . . . . . . . 9

2.3 A decision tree with two classes: married and single . . . . . . . . . . . . 13

2.4 Support vector machine in a two-dimensional space; source [Fla12] . . . . 16

2.5 A separating hyperplane with two outliers; source [MRT12] . . . . . . . . 19

2.6 The version space and the target function; source [Set12] . . . . . . . . . 26

2.7 Data matching steps of two databases; source [Chr12]updated . . . . . . 29

4.1 Class diagram of important classes. . . . . . . . . . . . . . . . . . . . . . 50

4.2 The ComparisonVector class. . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 The TrainingDataManager class (public view only). . . . . . . . . . . . . 52

4.4 Querying with OpenJUMP. . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.5 Querying with OpenJUMP - the info tool. . . . . . . . . . . . . . . . . . 53

4.6 Activity diagram query(): querying a set of ComparisonVector objects. 54

4.7 Activity diagram generateSeed(): creates a set of labeled Comparison-

Vector objects from which to start active learning. This diagram is verysimilar to the selectBestQuery() diagram 4.17. . . . . . . . . . . . . . 55

4.8 The RapidMiner process which learns the SVM decision boundary. . . . . 56

4.9 The sub process of the “Optimize Parameters (Grid)” operator. . . . . . . 57

4.10 The training sub process of the “Validation” operator. . . . . . . . . . . . 57

4.11 The testing sub process of the “Validation” operator. . . . . . . . . . . . 57

4.12 Activity diagram trainSVM(): learns the SVM decision boundary as aSVMModel object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.13 Class diagram: SVMModel, SVMLearner and SVMActiveLearner. . . . . . 59

4.14 Activity diagram: active learning of SimMatching parameters. . . . . . . 60

87

Page 96: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

4.15 The SimMatchingWorker class of SimMatching. . . . . . . . . . . . . . . 61

4.16 The PartitionSeedQuery class of the optimizer. . . . . . . . . . . . . . 61

4.17 Activity diagram selectBestQuery(): finding the best query. . . . . . . 62

4.18 The SimMatchingConfig class of SimMatching. . . . . . . . . . . . . . . 63

5.1 An example of SimMatching results for Link 1 using the learned param-eters from table 5.4. The matches are in green, ATKIS objects with nomatch found are in red and non-matched D20 objects are in blue. Thesmall circles indicate the end points of line segments. . . . . . . . . . . . 69

5.2 An example of SimMatching results for Link 1 using the empirical pa-rameters from table 5.2. The matches are in green, ATKIS objects withno match found are in red and non-matched D20 objects are in blue. Thesmall circles indicate the end points of line segments. . . . . . . . . . . . 69

5.3 An example of SimMatching results for Link 2 using the learned parame-ters from table 5.6. The matches are in green, OSM objects with no matchfound are in orange and non-matched D20 objects are in blue. The smallcircles indicate the end points of line segments. . . . . . . . . . . . . . . . 72

5.4 An example of SimMatching results for Link 2 using the empirical pa-rameters from table 5.2. The matches are in green, OSM objects with nomatch found are in orange and non-matched D20 objects are in blue. Thesmall circles indicate the end points of line segments. . . . . . . . . . . . 73

5.5 An example of SimMatching results for Link 3 using the learned param-eters from table 5.8. The matches are in green, ATKIS objects with nomatch found are in red and non-matched OSM objects are in orange. Thesmall circles indicate the end points of line segments. . . . . . . . . . . . 75

5.6 An example of SimMatching results for Link 3 using the empirical param-eters from table 5.2. The matches are in green, ATKIS objects with nomatch found are in red and non-matched OSM objects are in orange. Thesmall circles indicate the end points of line segments. . . . . . . . . . . . 75

A.1 RapidMiner GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

A.2 The SVM operator (left) and its parameters (right) . . . . . . . . . . . . 80

A.3 The cross validation operator . . . . . . . . . . . . . . . . . . . . . . . . 81

A.4 The two sub-processes of the cross validation operator . . . . . . . . . . . 81

B.1 The CDBLink and ComponentDB classes of SimMatching. . . . . . . . . . . 83

B.2 The Matching class of SimMatching. . . . . . . . . . . . . . . . . . . . . 84

B.3 The Visualizer class of SimMatching. . . . . . . . . . . . . . . . . . . . 85

88

Page 97: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

B.4 The PreprocessManager class of SimMatching. . . . . . . . . . . . . . . 85

B.5 The DataProvider class of SimMatching. . . . . . . . . . . . . . . . . . . 86

89

Page 98: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Bibliography

[AM98] N. Abe, H. Mamitsuka. Query Learning Strategies Using Boosting and Bag-ging. In ICML. 1998, 1–9.

[Bur98] C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recog-nition. Data Min. Knowl. Discov., 2(2), 1998, 121–167.

[Chr12] P. Christen. Data Matching - Concepts and Techniques for Record Linkage,Entity Resolution, and Duplicate Detection. Data-centric systems and appli-cations. Springer, 2012.

[CMZ09] S. Chen, B. Ma, K. Zhang. On the similarity metric and the distance metric.Theor. Comput. Sci., 410(24-25), 2009, 2365–2376.

[DE95] I. Dagan, S. P. Engelson. Committee-Based Sampling For Training Proba-bilistic Classifiers. In ICML. 1995, 150–157.

[Fla12] P. Flach. Machine learning: the art and science of algorithms that make senseof data. Cambridge University Press, 2012.

[Kap] B. Kappen. Website: Machine Learning in the Netherlands. URL http:

//www.mlplatform.nl/what-is-machine-learning/.

[KL51] S. Kullback, R. A. Leibler. On information and sufficiency. The Annals ofMathematical Statistics, 22(1), 1951, 79–86.

[LM04] U. W. Lipeck, D. Mantel. Datenbankgestutztes Matching von Kartenobjekten.Mitteilungen des Bundesamtes fur Kartographie und Geodasie, 31, 2004, 145–153.

[MRT12] M. Mohri, A. Rostamizadeh, A. Talwalkar. Foundations of machine learning.The MIT Press, 2012.

[Ols09] F. Olsson. A literature survey of active machine learning in the context ofnatural language processing. SICS Technical Reports, Swedish Institute ofComputer Science, 2009. URL http://soda.swedish-ict.se/3600/.

[Qui86] J. R. Quinlan. Induction of Decision Trees. Machine Learning, 1(1), 1986,81–106.

90

Page 99: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

[Sch13] M. Schafers. Ahnlichkeitsbasiertes Matching linienformiger raumlicher Ob-jekte. Dissertationen (Dissertation in Vorbereitung), Institut fur PraktischeInformatik, Leibniz Universitat Hannover, Hannover, 2013.

[Set12] B. Settles. Active Learning. Synthesis Lectures on Artificial Intelligence andMachine Learning. Morgan & Claypool, 2012.

[Sha48] C. E. Shannon. A mathematical theory of communication. The Bell TechnicalJournal, 27(4), 1948, 379–423.

[SL12] M. Schafers, U. W. Lipeck. Similarity-based Matching of Spatial Objects.Technical Report Informatik-Berichte DB-01/2012, Institut fur Praktische In-formatik, Leibniz Universitat Hannover, Hannover, 2012.

[TKM02] S. Tejada, C. A. Knoblock, S. Minton. Learning domain-independent stringtransformation weights for high accuracy object identification. In Proceedingsof the 8th ACM SIGKDD international conference on Knowledge discoveryand data mining. ACM, 2002, 350–359.

[TL07] M. Tiedge, U. W. Lipeck. Graphbasiertes Matching in raumlichen Daten-banken. In Grundlagen von Datenbanken. 2007, 42–46.

91

Page 100: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has
Page 101: Optimizing parameters for similarity-based spatial matching · Spatial data matching [LM04] is a branch of data matching where the data being match-ed is spatial. Spatial data has

Erklarung

Hiermit versichere ich, dass ich die vorliegende Arbeit und die zugehorige Implemen-tierung selbststandig verfasst und dabei nur die angegebenen Quellen und Hilfsmittelverwendet habe.

Hannover, 22. Oktober 2013

Joseph Eid