Top Banner
ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with FN-DBScan and Bayesian Networks BY AARON R. ULRICH A project submitted in partial fulfillment Of the requirements for the degree of MASTER OF SCIENCE in INFORMATION SYSTEMS Athabasca, Alberta May, 2012 © Aaron R. Ulrich, 2012 1
51

ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

May 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

ATHABASCA UNIVERSITY

Modeling Uncertainty in datasets with FN-DBScan and Bayesian Networks

BY

AARON R. ULRICH

A project submitted in partial fulfillment

Of the requirements for the degree of

MASTER OF SCIENCE in INFORMATION SYSTEMS

Athabasca, Alberta

May, 2012

© Aaron R. Ulrich, 2012

1  

Page 2: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

DEDICATION

This work is dedicated to my loving wife Nicole and daughter Sophia.

2  

Page 3: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

ABSTRACT

This experiment investigates the effectiveness of using a hybrid AI solution in a semi-supervised learning

scenario. To achieve this, the hybrid applies clustering and classification techniques to the IRIS dataset and

measures various statistics used to determine a level of success. Currently, clustering methods consist of DBScan

and FN-DBScan where-as classification uses a Bayesian Network with hill climbing search and a simple estimator.

The measurements for the solutions effectiveness serve to answer the following three questions. First, is FN-DBScan

more effective than the original DBScan? Next, can additional accuracy be gained by using a classification step to

classify noise from the clustering process? And finally, is there a significant performance gain in using classification

for ad-hoc data rather then re-running the clustering process? The methodology used to achieve this first discovers

an optimal set of parameters for the algorithms in question and then applies 3 rounds of 10,000 10-Fold cross

validation iterations on the dataset to establish averages for the measurements. The resulting outcome of this

experiment empirically implies that FN-DBScan is indeed more effective than the DBScan, additional accuracy can

be gained by classifying noise in some cases, and that using classification to classify ad-hoc data into clusters offers

significant performance gains. It is also interesting to notice that although DBScan performed faster than FN-

DBScan, the classifier built from FN-DBScans cluster-set was able classify data faster. Additionally there appears to

be an optimal noise level with which the classifier provides gains in accuracy. Although these results look

promising, future work should provide additional verification by expanding this experiment to additional datasets.

3  

Page 4: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

ACKNOWLEDGMENTS

I would like to acknowledge the support and encouragement of Larbi Esmahi and Dragan Gasevic who inspired

many aspects of my research.

4  

Page 5: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

TABLE OF CONTENTS

CHAPTER I - Introduction ........................................................................................................................... 9 

1.1 Statement of the Purpose................................................................................................................... 10 

1.2 Chapter Organization ........................................................................................................................ 10 

1.3 Definition of Terms ........................................................................................................................... 10 

CHAPTER II – Review of Related Literature ............................................................................................ 12 

2.1 Uncertainty ........................................................................................................................................ 12 

2.2 Data Clustering ................................................................................................................................. 21 

2.3 Applications for uncertainty .............................................................................................................. 26 

2.4 Summary ........................................................................................................................................... 27 

CHAPTER III – Methodology and Design ................................................................................................. 28 

3.1 The Experiment Tools ....................................................................................................................... 28 

3.2 Measurements & Parameters ............................................................................................................ 28 

3.3 The System Design Overview ........................................................................................................... 30 

3.4 The System Design Components ...................................................................................................... 32 

3.5 Summary ........................................................................................................................................... 38 

CHAPTER IV - Analysis ............................................................................................................................ 39 

4.1 Parameter Selection for the IRIS Dataset.......................................................................................... 39 

4.2 Optimal Parameter Discovery for DBScan and Bayes’ HC/SE ........................................................ 40 

4.3 Optimal Parameter Discovery for FN-DBScan and Bayes’ HC/SE ................................................. 42 

4.4 Optimal Parameter Analysis ............................................................................................................. 44 

4.5 Summary ........................................................................................................................................... 46 

CHAPTER V – Recommendations and Conclusions ................................................................................. 47 

5.1 Suggestions for Further Research ..................................................................................................... 47 

5.2 Conclusions ....................................................................................................................................... 48 

REFERENCES ........................................................................................................................................... 49 

5  

Page 6: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

LIST OF EQUATIONS

Equation 1 – Posterior Probability ............................................................................................................................... 13

Equation 2 - Posterior Probability #2 ......................................................................................................................... 13

Equation 3 – Bayesian Network JPD Factorization ..................................................................................................... 14

Equation 4 – Defuzzification Center of Gravity Calculation ....................................................................................... 21

Equation 5 – Euclidean Distance ................................................................................................................................. 23

Equation 6 – MinMax Normalization .......................................................................................................................... 23

Equation 7 – Cardinality Calculation........................................................................................................................... 25

Equation 8 – Exponential Neighborhood Relation Function ....................................................................................... 25

Equation 9 – Permutation Calculation ......................................................................................................................... 33

6  

Page 7: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

LIST OF TABLES

Table 1: Permutation Parameters ................................................................................................................................. 29

Table 2: Accuracy Measures ....................................................................................................................................... 29

Table 3: Timing Measures ........................................................................................................................................... 30

Table 4: Parameter Permutation Input Rules ............................................................................................................... 33

Table 5: DBScan vs. FN-DBScan ............................................................................................................................... 36

Table 6: DBScan Optimal Parameters ......................................................................................................................... 41

Table 7: FN-DBScan Optimal Parameters ................................................................................................................... 43

Table 8: Accuracy Analysis ......................................................................................................................................... 44

Table 9: Noise Classification Analysis ........................................................................................................................ 45

Table 10: Time Analysis .............................................................................................................................................. 45

7  

Page 8: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

LIST OF FIGURES

Figure 1: Trained Bayesian Network Visualization ..................................................................................................... 15

Figure 2: One Dimensional Membership Functions .................................................................................................... 16

Figure 3: Composite Membership Functions ............................................................................................................... 16

Figure 4: The Age Variable ......................................................................................................................................... 17

Figure 5: Deriving hedge terms ................................................................................................................................... 17

Figure 6: Simple Inference Engine .............................................................................................................................. 18

Figure 7: Fuzzification – Crisp Value to Fuzzy Value ................................................................................................ 19

Figure 8: Probability of Success .................................................................................................................................. 20

Figure 9: E-Neighborhood ........................................................................................................................................... 23

Figure 10: Density Connectivity .................................................................................................................................. 24

Figure 11: System Overview ....................................................................................................................................... 31

Figure 12: System Data Flow ...................................................................................................................................... 32

Figure 13: Permutation Engine .................................................................................................................................... 33

Figure 14: Data Engine ................................................................................................................................................ 34

Figure 15: AI Engine ................................................................................................................................................... 35

Figure 16: DBScan Parameter Analysis ...................................................................................................................... 41

Figure 17: High accuracy membership counts for e-radius ......................................................................................... 41

Figure 18: FN-DBscan Parameter Analysis ................................................................................................................. 43

8  

Page 9: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

CHAPTER I

INTRODUCTION

We exist in a world rife with magnitudes of uncertain and imprecise data. The era when rigid boundaries

between classifications would placate our analytic requirements is gone. In this world modeling uncertainty provides

a way to deal with an all-to-often partial view of reality for the ever growing flow of data. And it is this perpetual

flow of data which drives an insatiable need for effective clustering and rapid classification of real world data.

Data clustering is a mostly unsupervised method used to group data into natural partitions while requiring a

minimal amount of human interaction. A popular clustering strategy is known as Density Based Spatial Clustering

of Applications with Noise (DBScan) which is capable of discovering arbitrarily shaped clusters and detecting noise.

A technique known as Fuzzy-Neighborhood DBScan (FN-DBScan) enhances the mechanism which the original

DBScan used to discover core points; core points form the backbone of a clusters shape. FN-DBScan is based on the

idea of passing a Fuzzy Neighborhood Relation Function (F-NRF) as a parameter to the clustering algorithm. The F-

NRF is a flexible concept which determines the strength with which a given point gravitates towards becoming a

core point based on its neighborhood. A common F-NRF is exponential, but it is also trivial to model the original

DBScan’s point count using this concept.

Data classification is used to classify new data based on the knowledge obtained from existing data. A

trained classifier represents a model of the dataset under study and will produce a best fit answer when given a new

data object. One such technique, which is deeply rooted in probability theory, is known as a Bayesian Network

(BN). A BN uses a probabilistic logic model in the form of a structured probability network to deduce a

classification with the highest probability. Classification can potentially assist the clustering process by attempting

to classify noisy data points and also perform rapid ad-hoc cluster assignment to new points.

A widely accepted method to deal with uncertain boundaries is known as fuzzy set theory. Fuzzy set theory

was developed to mathematically support a system where objects have multiple set memberships, each of which

represents a degree of belief in the set membership. Due to this inherent ability to belong to multiple sets, fuzzy

values provide a way to deal with non-crisp (fuzzy) knowledge.

9  

Page 10: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

1.1 Statement of the Purpose

The purpose of this experiment is to measure the success of a hybrid AI solution which is capable of semi-

supervised learning through the use of both clustering and classification with a focus on fuzzy methods. In the

current experiment, the clustering methods under observation include DBScan and FN-DBScan with the

classification process using a Bayesian Network with hill climbing search and a simple estimator. This experiment

will serve to explore three primary questions: First, is FN-DBScan more effective than the original DBScan?

Second, can additional accuracy be gained by performing an additional classification step to classify the noise

generated by the clustering process? And finally, what performance gain is achievable by using Bayesian

classification on new data, rather than re-running the clustering process?

1.2 Chapter Organization

The subsequent sections are organized as follows. Chapter two introduces background knowledge and

previous research dealing with uncertainty and data clustering. Once the background is established chapter three

describes the experiment and the system architecture used to generate the experimental results. Next, given the

generated results chapter four provides an analytical study into how the results support the questions under study.

Finally a conclusion and suggestions for further research are given.

1.3 Definition of Terms

Acronym Definition DBScan Density Based Spatial Clustering of Applications with Noise FN-DBScan Fuzzy-Neighborhood DBScan SFN-DBSCAN Scalable FN-DBScan F-NRF Fuzzy Neighborhood Relation Function BN Bayesian Network AI Artificial Intelligence GA Genetic Algorithm LLN Law of large numbers CLT Central limit theorem DAC Directed Acyclic Graph JPD joint probability distribution CPT conditional probability table MF Membership Function DIL Dilation CON Concentration INT Contrast Intensification DIM Diminishing COG center of gravity NRFJP Noise robust fuzzy joint points KNN K-Nearest Neighbor

10  

Page 11: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

xFKM fuzzy k-means CSV Comma Separated Values

11  

Page 12: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

CHAPTER II

REVIEW OF RELATED LITERATURE

Uncertainty is an ever present issue surrounding data analysis that plagues the process with potential error

at every step. This chapter will discuss several techniques which can help mitigate the effects of uncertainty. The

first technique is based on probability theory and is presented in the form of Bayesian Classification. The next is a

branch of set theory known as fuzzy set theory. And finally a branch of data mining known as density based

clustering will be introduced through a discussion of the DBScan algorithm.

2.1 Uncertainty

There are many instances where absolute data precision is simply not possible [1, 2, 3, 4], which is

particularly true when dealing with real-world models pertaining to time and space which have an infinite level of

detail. There are two primary branches in the field of uncertainty. Probability theory is highly objective and is based

on the establishment of causal relationships through the collection of empirical data, and fuzzy set theory which is

usually considered more subjective in that the relationships modeled are based on the modeler’s impression of the

variable in question. A couple other important uncertainty techniques are rough set and evolutionary theory. Rough

set theory, developed by Pawlak [56], is similar to fuzzy theory in that they both represent uncertainty about set

memberships. Rough set theory, however, uses the concept of set boundaries defined through topology operations to

express vagueness of membership rather than the fuzzy method of using membership functions to establish a degree

of belief. An important aspect of the Rough set approach is that it does not require any preliminary meta-data such

as the membership functions required by fuzzy sets which makes it an appealing candidate for unsupervised

solutions. Evolutionary theory represents a field of study which is influenced by biological models for evolution

such as genetics. A popular evolutionary design is known as a Genetic Algorithm (GA) which was predominantly

pioneered by Holland [57] and uses a probabilistic transition scheme through populations. The purpose of a GA is to

effectively and efficiently evolve an optimal solution to the problem under study by first applying a fitness test to

determine the strongest members of a population and then evolving new populations based on these members until a

terminating condition is met.

2.1.1 General Probability Theory

12  

Page 13: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

The undeniable need to understand uncertain events has led to a branch of mathematics known as

probability theory. Kolmogorov's Grundbegriffe [5] represents a milestone synthesis establishing the fundamental

foundation for modern probability theory [6]. Two fundamental axioms emerge from the foundations of probability,

the Law of large numbers (LLN) and the Central limit theorem (CLT), both of which apply to large sequences of

independent random variables. LLN simply states that as a sample set increases the sample average will begin to

converge at the probability distribution P, where-as CLT states that given a sufficiently large sample set the random

variables will be approximately normally distributed.

When modeling a system using probability theory the thematic objects consist of random (stochastic)

variables, random processes, and events. An event represents the set of possible outcomes where each outcome is

known as a random variable. The value associated with a random variable can be modeled using a probability

distribution. In the simplest case, which assumes discrete time, a sequence of random variables represents a random

process known as a time series. The two classes of probability distributions used to model random variables are

discrete and continuous. A discrete probability distribution deals with event distributions that occur in a countable

sample space, such as the probability of drawing a specific card from a 52-card deck. Continuous probability

distributions deal with event distributions that occur in a continuous sample space, such as a real number

measurement.

2.1.2 Bayesian Classification

The famous Bayes’ Theorem is commonly credited to the English mathematician and theologian Thomas

Bayes (1702 to 1761), although there is some evidence which implies otherwise [7]. This theorem is the basis for

much research dealing with situations that lack certainty, including Bayesian networks. Bayes’ theorem is based on

the concept of conditional probabilities (posterior probabilities):

P(B | A) = P(B ^ A) / P(A) Equation 1 – Posterior Probability

Which reads the probability of B given A is equal to the probability of the intersection between B and A (both B and

A are true) divided by the probability of A. In the above formula P(B|A) is referred to as the posterior probability of

B. This idea is the basis for Bayes’ Theorem which is often written as follows:

P(B | A) = [P(A | B) * P(B)] / P(A) Equation 2 - Posterior Probability #2

In its simplest implementation, Naïve Bayes’ (Simple Bayesian) directly applies Bayes’ theorem and will predict

that A belongs to the class having the highest posterior probability, P(B|A), conditioned on A. A key characteristic

13  

Page 14: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

of this algorithm is its simplicity, which is achieved in the design by adopting class conditional independence by

treating all attribute class values as independent of other attribute class values.

2.1.2.1 Bayesian networks

Bayesian networks (BN’s) were first introduced by Pearl [8]. A key aspect of the design is the capability of

encoding variable dependencies as a causal model. To achieve dependent variables a finite set of random attributes

(variables) are positioned within a Directed Acyclic Graph (DAC) to form a causal topology of nodes which encodes

the dependencies in the form of a network. Each node within the topology has a joint probability distribution (JPD),

also-known-as a conditional probability table (CPT), which represents the nodes value probability given a specific

combination of direct parent variable values. It is an important condition of Bayesian Networks that only the direct

parents need to be considered for the CPT; in this sense, BN’s satisfy the Markov condition, which means, that

given their parents the nodes within a network are conditionally independent of their non-descendants. Because of

this condition the JPD can be factorized to the following equation:

P(X1, X2, …, Xn) = Πi P(Xi | PA(Xi))

Equation 3 – Bayesian Network JPD

Factorization

In the above mentioned Equation 3, Π is the product symbol, X1… Xn are the set of variables in the DAC, and

PA(Xi) represents the parents of Xi in the network.

2.1.2.2 Network Inference

In a Bayesian Network inference is achieved by maximizing the posterior (conditional) probability of a

class for a given set of attribute values. The simplest method of performing Bayesian inference is known as the

enumeration algorithm [9], which can be greatly enhanced by applying variable elimination [10]. Put concisely the

enumeration algorithm is a recursive algorithm that performs a summation of products of the conditional

probabilities while ensuring probabilistic normalization [11] and in this way is able to achieve a maximal posterior

probability. Figure 1, adapted from [12], visualizes a simple trained network which can be used to infer the

probabilities of outcomes based on a set of evidence. For example, what is the probability that someone goes to

college (C), Studies (S), Does not party (¬P), Succeeds in their exams (E), and does not have fun (¬F)? The

calculation that follows can be traced by observing the highlighted areas in Figure 1.

P(C, S, ¬P, E, ¬F) = P(C) * P(S|C) * P(¬P|C) * P(E|S Ʌ¬P) * P(¬F|¬P)

= 0.2 * 0.8 * 0.4 * 0.9 * 0.3 = 0.01728

14  

Page 15: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

Figure 1: Trained Bayesian Network Visualization

2.1.2.3 Network Training

The goal of training is to learn the Conditional Probability Table (CPT) entries for the network. A popular

technique described by Russell et. al. [13] utilizes a gradient descent strategy to perform greedy hill climbing. A

similar strategy is also implemented by neural networks to back-propagate the error (gradient). It is important to

note, however, that back-propagation is not required in a probabilistic network because the information is available

to compute it locally by accessing the direct parents. The purpose of a gradient descent strategy is to converge at the

local optima by updating the weights during each iteration. Learning the CPT entries (training the network) using

the technique described by Russell et. al. [13] is as follows:

1. Define or discover the topology

2. Set each CPT entry to a random probability value (between 0.0 and 1.0)

3. For each piece of training data, iterate the topology nodes in a top-down manner

a. Compute the gradient of the Node based on the current training data

b. Update the weights based on learning rate and newly discovered gradient. This moves

towards the optimal local solution without backtracking (greedy hill-climbing).

c. Renormalize weights to ensure they add up to 1.0

15  

Page 16: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

2.1.3 Fuzzy Set Theory

The concept of the fuzzy-based methodology has its origins in fuzzy-set theory as proposed by Zadeh [14]

and later refined by Kandel & Byatt [15]. The underlying concept behind Zadeh’s theory is simply that “a fuzzy set

expresses the degree to which an element belongs to a set” [14], where the degree is expressed as a value between

0.0-1.0 and is assigned to an object by a Membership Function (MF). Unlike classical sets, which an object either

belongs or does not, “a fuzzy set is a set without a crisp boundary” [16].

2.1.3.1 Fuzzy Membership Functions

A membership function is the mechanism which defines a mapping between an input x value and an output

y value. The typical one dimensional functions are shown in Figure 2 which include: Triangular MFs, Trapezoidal

MF, Gaussian MF, Generalized bell MF, Sigmoidal MF, and Left-Right MF. It is also possible to combine MF’s

using logical statements, such as AND/OR, which is demonstrated in Figure 3 where the first image represents both

functions, the second applies AND, and the third applies OR.

Figure 2: One Dimensional Membership Functions

AND OR

Figure 3: Composite Membership Functions

2.1.3.2 Linguistic Variables

A universe of discourse X is often called a linguistic variable. When this variable is a continuous space,

which is often the case, then it is usually modeled using multiple MF’s, each of which define a linguistic value (aka

linguistic term) for the variable. In a linguistic statement terms can be linked using connectives (and, or, etc.),

16  

Page 17: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

negated (not), and/or modified by hedges (very, sort-of, a-little-bit, etc.). Figure 4, adapted from [16] exemplifies the

Age variable with three linguistic terms: Young, Middle Aged, and Old. Notice the shaded areas in the figure

represent areas in which both terms hold true, in this way it is possible to be in a state that has a degree of belief

associated with multiple linguistic terms.

Figure 4: The Age Variable

2.1.3.3 Language Modifiers

Several functions exist to generate derivative curves which establish an elegant method to handle language

modifiers (aka hedges). These include Dilation (DIL), Concentration (CON), Contrast Intensification (INT), and

Diminishing (DIM). These modifiers allow modeling more complex linguistic statements. For example, assume we

would like to extend the linguistic variable Age discussed above with terms modified by the hedges very, sort-of,

and definitely. In Figure 5 shown below, very young and very old have been modified with CON, sort-of young and

sort-of old are derived with DIL, and definitely middle aged is a result of applying INT which creates a crisper

curve. The modifier DIM, which is not represented below, is simply the inverse of INT.

Figure 5: Deriving hedge terms

17  

Page 18: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

2.1.3.4 Fuzzy Inference

The simple inference engine portrayed in Figure 6 takes a single input value X (could be crisp or fuzzy) and

outputs a fuzzy valued result Y with the option of defuzzification. Fuzzy values are used as inputs into the actual

inference engine. This engine contains a set of rules, sometimes referred to as a rule-base, and an aggregator which

composes the final result based on the combined outcome of the rule-base.

Figure 6: Simple Inference Engine

2.1.3.5 Fuzzification

The first step of inference is to find the fuzzy value for the input X. This is done by using the term MF’s to

lookup a degrees of belief (or membership) based on the input variable. In Figure 7 a fuzzification process for two

variables is shown. Given the linguistic variable Age with input parameter of a=15 a table of belief values for the set

of terms will be generated. Similarly the Experience variable is given the parameter of e=38 which produces a

similar table. It is this table of belief values that establishes the foundation for fuzzy values.

18  

Page 19: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

Age

Experience

Figure 7: Fuzzification – Crisp Value to Fuzzy Value

2.1.3.6 The Rule-base

The next step in the inference engine involves iterating the rules in the rule set and calculating the

respective output belief values. The following rule listing summarizes this process on a rule-base containing three

simple rules.

IF (Age = very young) OR (Experience = low) Success Prob = Poor Equation µA˅B(x)= max[µA(x), µB(x)]

µsuccess=poor = max[0.25, 0.15] = 0.25

IF (Age = young) AND (Experience = moderate) Success Prob = Good Equation µA˄B(x)= min[µA(x), µB(x)]

µsuccess=good = min[0.75, 0.35] = 0.35

IF (Age = middle aged) Success Prob = Excellent

19  

Page 20: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

Equation µA(x)=µA(x)

µsuccess=excellent = 0.1

The first rule performs an or-operation between the very young entry of the Age fuzzy value and the low entry of the

experience fuzzy value. The outcome of the first rule represents the degree to which the inference engine believes

this rules implication, which in this case is the probability of success being poor. After the belief values have been

calculated for each rule, the output is passed on to the aggregator.

2.1.3.7 The Aggregator

The aggregator varies between implementations but will typically average the rule outcomes of each

linguistic term to establish a new fuzzy value. Building upon the outcomes of the rule listing described above this

process would produce a fuzzy value that contains the following:

µsuccess=poor = 0.25 µsuccess=good = 0.35 µsuccess=excellent = 0.1

2.1.3.8 Defuzzification

The final step in the process is known as defuzzification. A common technique used to perform this task is

the centroid method which calculates the gravity for the area under a curve. Figure 8 highlights the area based on the

fuzzy value which was determined by the aggregator in the previous section.

Figure 8: Probability of Success

Calculation of the centroid, or center of gravity (COG), is done using the following formula:

20  

Page 21: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

=

== n

axA

n

axA

x

xxCOG

)(

)(

μ

μ

Equation 4 – Defuzzification Center of Gravity

Calculation

The COG formula given in Equation 4 steps through the range of values on the x-axis using some

increment, using Figure 8 as an example, the range would be from 0 to 100 with an increment set to 10. First, for

each step in the range a weighted value is created by multiplying the x-axis value with the membership value (y-

axis); the weighted values are then summed to create a weighted sum. Next a participation weight for each

membership value is created by multiplying the number of steps participating in the value; these weights are then

summed to create a participation sum. Finally the weighted sum is divided by the participation sum. Applying this

formula to the running example, with a curve sample rate of 10, yields the following calculation:

COG = [ (0 + 10 + 20) * 0.25 ] + [ (30 + 40 + 50 + 60) * 0.35 ] + [ (70 + 80 + 90 + 100) * 0.1 ] /

[ (3 * 0.25) + (4 * 0.35) + (4 * 0.1)]

= 104.5 / 2.55 = 40.98 The value of 40.98 represents the deffuzified Probability of Success given the input values (a=15 and e=38) and the

rule set described previously.

2.2 Data Clustering

The data available in most disciplines is mostly unlabeled and unclassified [17]. The purpose of clustering

in data analysis is to partition a dataset into homogenous groupings known as clusters [18], and these groupings

represent naturally occurring data classifications. In data mining clustering is a progressive process that attempts to

optimally minimize inter-cluster similarity while maximizing intra-cluster similarity [19]. This process has its roots

in statistics and originates in a paper by Fisher [20], in which it is stated that “it is often important to know how a

population may be decomposed into sub-groups that contrast sharply with each other, individuals of the same group

being fairly alike.”

2.2.1 Types of Clustering

Two key classifications which often help in distinguishing between types of clustering algorithms are based

on the results produced and the algorithms used [21]. The first type of classification, discussed by Kaufman &

Rousseeuw [22], organizes clustering methods by the results produced and includes hierarchical as-well-as

21  

Page 22: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

partitioning based methods. Hierarchical strategies split the data into dendrogram trees which can then be iterated

using either a top-down or bottom-up approach. Partitioning algorithms work by splitting the data into a set of k

clusters. The second classification organizes the methods by the style of the algorithm and includes optimization-

based, distance-based, and density-based. Of these the most popular algorithms are density and distance-based [23].

Distance-based methods are commonly used with partitioning algorithms and are limited to discovering spherically

shaped clusters. Density-based methods are focused on “growing” the clusters based on thresholds set in the input

parameters. Because of this dynamic growth, density-based clustering is capable of determining arbitrarily shaped

clusters. Early attempts at density clustering [24] suffered from prohibitive space and run-time complexity but

research continues to mitigate these shortcomings.

2.2.2 DBScan

The need for effective and efficient clustering has well established roots in the complexities of spatial

database systems [25, 26, 27, 28, 29], and more recently has been applied to the study of applications involving

imprecise data [1, 2, 3, 4]. Density-based algorithms are often preferred where imprecise data exists, and therefore

fuzzy distances naturally occur [30]. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a

density-based clustering algorithm capable of forming arbitrarily shaped clusters as-well-as detecting noise [31]

which has proven popular through its many variants [32, 33] designed to mitigate various limitations. The

fundamental idea behind DBScan is the core object which is discovered by analyzing the E-Neighborhood and

forms the backbone of the clusters shape.

The E-Neighborhood is the set of objects within the radius (ε ) of the object in question. The neighborhood

of object X is shown as the dotted surrounding circle in Figure 9. All y objects, with exception of y6, are within the

radius 5.0=ε and are therefore considered to be part of object X’s neighborhood. To discover the set of neighbors

the simplest form of DBScan iterates the dataset and measures the distance between the object in question (object x)

and the object currently being iterated (object y). These distances are the numbers beside the dotted line connecting

the two objects in the figure. A common distance function is known as Euclidean distance, given in the below

equation. In order to apply this function to measuring the distance between data objects it is important to note that it

is summing the distance between each attribute squared, where (x1i - x2i) is the distance between the attribute for the

two objects.

22  

Page 23: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

∑=

−=n

iii xxxxdist

1

22121 )(),( Equation 5 – Euclidean Distance

It is often desirable to normalize the attribute values into a standard range such as [0, 1]. A common method to

achieve normalization is known as minmax and is given in Equation 6 shown below. It is important to realize that

minmax requires the maximum and minimum values for the attribute to which v (the value) belongs.

AA

Avvminmax

min)max(min−

−= Equation 6 – MinMax Normalization

Figure 9: E-Neighborhood

An object becomes a Core Object if the E-Neighborhood contains a number of objects equal to at least

MinPts. If an object is within the E-Neighborhood of a core object then it is directly density reachable from that core

object. Furthermore, two objects are indirectly density reachable if there is a path of core objects between them. A

walk of paths through the connectivity network is known as density connectivity.

In Figure 10, MinPts = 3 and E-Radius create the spheres encompassing the neighborhoods of the various

points. The core points in this example are C, D and E, each of which contain 3 points. Furthermore, point B is

directly density reachable from point C because B is within the neighborhood of point C which is a core point; point

F is directly density reachable from point E for similar reasons. B is indirectly density reachable from point E

23  

Page 24: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

because there is a path of core point neighborhoods between them; E is not indirectly density reachable because B is

not a core point; F and C have a similar relationship. Points B,C,D,E and F are all considered to be density

connected and would represent a discovered cluster in DBScan. The points A and G are outliers and would be

flagged as noise.

Figure 10: Density Connectivity

2.2.3 DBScan Varients

DBScan is an elegant algorithm with great potential, but it is not without its limitations. The four major

difficulties plaguing DBScan implementations in current research include computation complexity, memory

consumption, datasets with varying densities, and parameter selection. The computational complexity of DBScan is

capable of achieving the undeniably poor rating of O(n^2) [34]; albeit, its normal rating is O(nlogn). Various studies

have addressed this issue in certain circumstances and demonstrated performance ratings of linear [35,36] and near-

linear [37]. As dataset size increases the memory resources required by DBScan will grow proportionally due to its

algorithmic requirement to load the entire dataset into memory. Achieving a more scalable algorithm is addressed in

various branches of research [3, 38, 39]. The issue of varying densities has also been acknowledged in literature as a

critical limitation of the DBScan algorithm. The algorithm OPTICS [40], for example, is a cluster analysis method

concerned with computing an augmented cluster ordering. This augmented data can be used to enable varying

densities. Additionally, this ordering is conducive of human analysis and/or automated parameter selection. Uncu et.

al. [41] further extend the notion of OPTICS by developing an algorithm that focuses on density parameter values

for regions rather than across all data vectors.

2.2.4 Fuzzy Clustering

It has been said that the application of fuzzy set theory to the clustering problem can increase adequacy of

the results obtained [42]. There are currently two main branches of fuzzy clustering in research, the first defines

24  

Page 25: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

fuzzy boundaries between object classes, and the second regards the objects themselves as fuzzy representations of

their true values. This paper will focus on the former of these.

2.2.4.1 Fuzzy Neighborhood Relations

Traditional fuzzy clustering [43] evolved from a need for non-crisp, or fuzzy, boundaries between clusters

of objects. Traditionally each object in a dataset has a degree of membership for a set of clusters, where the

membership value is calculated by the specified MF. This is the premise behind the Fuzzy Neighborhood Relation

Function (F-NRF) [33] method which combines NRFJP (Noise robust fuzzy joint points) [44, 45] with DBScan [31]

to create FN-DBSCAN. This algorithm is further enhanced with scalability in [17] when Parker, Hall, and Kandel

combine FN-DBSCAN with SDBDC [3, 23] to produce an algorithm known as SFN-DBSCAN. In F-NRF the

fuzzy membership is established by measuring the sum of all comparisons between an object and its’ neighborhood

objects, where the comparison is defined by the neighborhood relation function.

2.2.4.2 Neighborhood Relation Function

A neighborhood relation function is used to calculate what has been referred to as the neighborhood

cardinality [17] which represents the cumulative density of the objects (y) surrounding the object (x) in question.

The Cardinality calculation is shown below in Equation 7 which produces a summation of the exponential

neighborhood relation function shown Equation 8.

)()( yNxCard xNy∑=ε

Equation 7 – Cardinality Calculation

⎥⎥⎦

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠⎞

⎜⎝⎛−=

2

max

),(**1exp)(d

yxdistKyN x

Equation 8 – Exponential Neighborhood

Relation Function

In Equation 8 dmax is the maximum distance possible between two data objects and dist(x, y) represents the actual

distance between the two objects in question. The parameter K, such that K>0, provides additional sensitivity, and in

the case of Equation 8 has an effect on the radius [33]. Although implementations may vary, it is common to use the

distance of 1 if two nominal attributes differ and 0 if they are the same. Furthermore if an attribute is numeric it is

often normalized using MinMax normalization.

25  

Page 26: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

2.2.4.3 Fuzzy Object Representations

Another branch of the fuzzy clustering paradigm, referred to as Fuzzy Object Representations [46], views

the objects themselves as fuzzy rather than the boundaries separating the different classifications. This paradigm

embraces the imprecise nature of measurement data and allows modeling uncertainty in terms of knowledge about

the actual objects; this is achieved by using fuzzy distance functions to measure the distance between fuzzy objects.

Furthermore, the solution established in [46], is able to deal with Multi-Represented Objects as described in [47].

Multi-represented objects are objects in which queries need to be made at multiple resolutions, which can be thought

of as fuzzy queries; an example of this would be a spatial map which requires queries to be made at various

resolutions of detail.

2.2.4.4 Other fuzzy clustering methods

Fuzzy clustering first emerged in 1973 when Dunn [48] defined a fuzzy generalization for the min-variance

partitioning problem, which is better known was fuzzy k-means and was further refined by Bezdek [49]. In 1985

Keller, Grey and Givens [50] extended Cover and Hard’s K-Nearest Neighbor (KNN) algorithm [51] with fuzzy set

theory. A significant limitation of early clustering methods is the poor ability to deal with non-numeric data types.

Recent research by Jiacai and Ruijun [52] describes an algorithm referred to as extended fuzzy k-means (xFKM)

which incorporates techniques used for clustering categorical data. Because of this a key advantage of xFKM is that

it does not require converting nominal attributes into binary attributes which it achieves by using the categorical

dissimilarity measure.

2.3 Applications for uncertainty

The effect of uncertainty can be found in any application requiring the measurement of real world

phenomena or data entry and/or collection conducted by human agents. Global Position Systems (GPS) provide

position tracking of moving objects, but due to realities in timing and physical measurement it will always contain a

certain degree of uncertainty [53], which also extends to any measurements pertaining to moving objects [1, 2, 3, 4].

Another important aspect of uncertainty comes from human error and deception. Human error is commonly

associated with data entry related tasks with which a certain probability of error is always present. Deception

pertains to the possibility that data entered may have been falsified during the collection process.

26  

Page 27: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

To help accommodate uncertainty in data mining tools need to be developed to allow crisp data sets to be

interrelated using fuzzy concepts. In the following sections the Fuzzy-Neighborhood DBScan algorithm will be

implemented for experimentation and its performance compared with the original DBScan.

2.4Summary

This chapter introduced research surrounding several important techniques addressing uncertainty issues in

data analysis. The chapter started with an introduction to general probability theory which then worked into the

areas of Bayesian Networks and Fuzzy Set theory. Additionally data clustering with a focus on the density based

DBScan and FN-DBScan was discussed followed by a section describing various areas in which uncertainty raises

issues. The following chapter proposes a system architecture that will harmonize the use of Bayesian classification,

fuzzy set theory, and the density-based DBScan clustering algorithm.

27  

Page 28: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

CHAPTER III

METHODOLOGY AND DESIGN

The architecture implemented for this experiment is a Weka based hybrid model which uses both clustering

and classification to establish and maintain a clustered dataset. The clustering strategies implemented are DBScan

and FN-DBScan, and the classification strategy currently uses a Bayesian Network to classify noise from the

clustering process. To verify the experimental results the architecture was designed to gather various measurements

pertaining to the accuracy and efficiency of the algorithms used. In this design the Weka API is directly accessible

and extendible by the architecture by adding the API as an external reference library to the Eclipse workspace. To

enhance Weka’s extendibility of the original DBScan a new version, DBScanV2, was implemented as a slightly

modified variant.

3.1 The Experiment Tools

This experiment utilizes the Weka API in a Java simulation developed with the Eclipse IDE and was run

using an Intel i7 quad-core machine with 12 gb of memory. For data analysis of the parameter permutations the

heap-size allocated for the Weka explorer was also increased to 8192mb. Further analysis was performed using 32-

bit MS Excel which required ensuring the permutation result dataset was less than 1,048,576 entries.

3.2 Measurements & Parameters

The product of executing the broad exploration portion of the code is an output CSV file containing various

measures for each permutation explored. For each entry there will be a set of permutation parameters, accuracy

measures, and timing measures. The permutation parameters are used to initialize the hybrid algorithm which will

establish a test case for measurements to be taken from. The measurements, for each test case, relate to the

performance of the clustering algorithm (DBScan/FN-DBScan) and the effectiveness of the Bayesian Network.

3.2.1 Permutation Parameters

Permutation parameters represent the input parameters used for the result records test case, and will be

further discussed in the permutation engine section. Table 1 shown below describes the various parameters which

are used in this experiment.

28  

Page 29: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

Table 1: Permutation Parameters

Column Algorithm Description MinPts DBScan Determines the number of required points within a given neighborhood to

signify a core point. MinCard FN-DBScan Determines the minimum cardinality within a given neighborhood to signify a

core point. The concept of cardinality represents a measure that encompasses both density and magnitude.

K FN-DBScan A constant value used with the neighborhood relation functions. e DBScan / FN-

DBScan The radius measure used to determine which objects fall within an objects neighborhood.

Alpha Bayes’ Used in the calculation of conditional probabilities. P Bayes’ The maximum number of parents used in the search algorithm. N Bayes’ If true (1) then it sets the initial structure to empty instead of naïve bayes. MBC Bayes’ If true (1) then a Markov Blanket correction is applied to the learned network

structure. R Bayes’ If true (1) then use the arc reversal operation. S Bayes’ The score type such that

0 = Bayes 1 = MDL 2 = Entropy 3 = AIC 4 = Cross Classic 5 = Cross Bayes

3.2.2 Accuracy Measures

The primary measure for the accuracy of a permutation is based on the percentage of errors in the testing

set after noise classification has occurred (PostNC), but observing the error in the training set is also useful. For both

sets (testing and training) there is a Pre Noise Classification (PreNC) and Post Noise Classification (PostNC) error

value. The noise classification process pertains to using the Bayesian classifier to classify the noise generated by the

clustering process. In Table 2, shown below, the set of various accuracy measures are summarized.

Table 2: Accuracy Measures

Column Description ClusterCount The average number of clusters produced during the clustering process. ClusterNoisePER The average percentage of noise produced during the clustering process. IsBayesRun The average number of times the Bayesian classifier was run; it won’t run if the

noise level is equal-to or greater-than 80%. A value of 1.0 indicates Bayes is always run and 0.0 indicates it is never run.

ErrTestPreNCPER The percentage of incorrectly classified objects in the testing set during the Pre-Noise Classification (PreNC) step.

ErrTestPostNCPER The percentage of incorrectly classified objects in the testing set during the Post-Noise Classification (PostNC) step.

ErrTrainPreNCPER The percentage of incorrectly classified objects in the training set during the Pre-Noise Classification (PreNC) step.

ErrTrainPostNCPER The percentage of incorrectly classified objects in the training set during the Post-Noise Classification (PostNC) step.

29  

Page 30: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

The measure of error for the testing set is the percentage of unsuccessfully classified test objects using the

input Bayesian classifier which would be the PreNC classifier or the PostNC classifier. Successful classification is

determined by comparing the test objects actual class with the maximal class found within the cluster. Unlike test

cases, measuring error in the training set does not directly use the classifier, and the PreNC error is unaffected by

classification which makes it a direct measure of only the clustering process. Just like training cases the rate of error

in test cases is measured by comparing the objects actual class with the cluster numbers maximal class, training

objects however differs in that any objects flagged as noise by the clusterer are automatically considered error.

3.2.3 Timing Measures

Time is a critical factor in the performance of real world applications. Table 3 below describes the various

values gathered by this experiment pertaining to time.

Table 3: Timing Measures

Column Description TimeEntirePermutation The time taken to run through the entire K-Fold process for the given permutation.

10 Fold Cross Validation, for example, would run the process once for each fold accumulating values to produce averages.

TimeDBScanALL The time taken to perform the entire clustering process, including the extra steps required to make the cluster number readily available for the classification phase.

TimeDBScanBuild The time taken to perform the actual clustering process when using the specified DBScan variant.

TimeBayesAll The time taken to perform the entire Bayesian process. TimeBayesTrainPreNC The time taken to remove the noise and train the initial classifier. TimeBayesTrainPostNC The time taken to classify the noise and retrain the Bayesian classifier. TimeBayesClassify Average time taken to classify and object during PreNC and PostNC classification. 3.2.4 Noise and Noise Classification measures

A measure of noise, ClusterNoisePER, is given as a percentage which represents the amount of singleton

(unclustered) data objects. Furthermore, both ErrTrainPostNC and ErrTestPostNC represent accuracy measures

after Bayesian noise classification has occurred. These values are used to help determine if there is any added

benefit in adding a noise classification phase after clustering occurs.

3.3 The System Design Overview

This system was designed to execute a set of test cases, also referred to as parameter permutations, in a way

that a set of empirical results are generated for further analysis. The design is composed of three core packages

shown in Figure 11: the permutation engine, the data engine, and the AI engine. The purpose of the permutation

30  

Page 31: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

engine is to encapsulate the mechanisms used to iterate through a set of possible parameter permutations while

gathering test case measurements to be used for further analysis. Next, the data engine provides a consolidation of

common tasks associated with dataset and data object management. And finally, the AI engine provides wrapper

classes for DBScan/FN-DBScan and Bayesian Networks which implement the hybrid logic and measurement

tracking required to conduct the experiment. In figure 11, the AppMain and App classes are used to provide the

logic pertaining to this specific experiment. Each of these packages will be discussed further in subsequent sections.

Figure 11: System Overview

A key feature of the design is the ability to repeat any number of K-Fold cross validation iterations to

generate an average for each parameter permutation it attempts. For the test cases used in this experiment the

permutations were explored in two ways. The first method provided a broad exploration by iterating the permutation

space defined by an input CSV file (see Table 4 in section 3.4.1.1) and the second method allowed the specific

permutations to be set and attempted in rapid succession to obtain a more accurate average. By conducting the broad

exploration first potential parameter combinations yielding desirable outcomes can be isolated for further

exploration. Once selected each of the candidates becomes the focus of the narrow but exhaustive second method to

establish an average set of representative measurements.

There are three elements of flow in the application shown in Figure 12: altering the training set, building

the clusters and classifiers, and performing tests using the testing set. Training set alterations occur at different

stages to prepare the data for the next step. These alterations include preparing a classless dataset for clustering,

updating the classless dataset with a cluster number, and removing noise for classifier training. Next, building the

clusters and classifiers is the central task of this experiment and starts with using the training data to establish

naturally occurring classifications of objects using DBScan or FN-DBScan. Following this all noisy objects are

31  

Page 32: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

removed from the dataset and the initial Bayesian Network is trained using the Cluster Number as the class. By

applying the initial Bayesian classifier to the noise additional classifications are discovered, and a final Bayesian

network is trained. Finally, accuracy is measured by comparing the maximal class name found in a cluster number

with the expected class value for a data object; this method is applied to both the training data and the testing data.

The testing data, however, gives a more realistic measurement of how well the trained classifier will perform when

given a set of unknown data.

Figure 12: System Data Flow

3.4 The System Design Components

3.4.1 The Parameter Permutation Engine

Input parameters are a critical component of many clustering and classification solutions and it is often

difficult to quickly establish which parameters will generate an optimal solution. To help with this process a

parameter permutation engine was designed which is capable of iterating through all possible permutations

following the rules defined in an input CSV file (see Table 4). The rules simply describe the algorithm and its

parameters in terms of testable ranges and the size of each iterative step. The permutation engine is also responsible

for storing the various measures gathered during each iteration. Each measure is stored in a PermResultObject which

32  

Page 33: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

maintains the average and min/max aggregates until the object is cleared. Figure 13, shown below, describes the

general structure of this package.

Figure 13: Permutation Engine

To understand the time complexity of this engine, it is important to calculate the number of possible values

for each input parameter and then multiply the totals together. So if there were n variables (V1…Vn), and their

cardinality |V1| represented the number of possible values for that variable, the number of permutations P could be

calculated as follows:

P = |V1| * |V2| * … * |Vn| Equation 9 – Permutation Calculation

So, if there were 25,000 permutations, each of which took 100ms to execute, the engine would require 2500

seconds, or about 42 minutes to run. In terms of both time complexity and storage this is far from perfect and future

work may focus on applying evolutionary algorithms to guide test cases.

3.4.1.1 Parameter permutations

It is often very difficult to select the input parameters required by various data mining algorithms. In order

to provide a thorough parameter result set for analysis, the testing methodology iterates through many input

parameter permutations for the data set under study. For each of these input parameter permutations 10-fold cross

validation is used to derive a set of averages to be recorded. The input parameter permutations for the test cases

discussed shortly were generated from a user defined CSV file containing the following:

Table 4: Parameter Permutation Input Rules

Algorithm Varient Parameter Value Range Step size Step Count DBScan Original MinPts 2-8 1 9 DBScan Original e 0.05-1 0.05 20 DBScan FN-DBScan MinCard 5-50 5 10 DBScan FN-DBScan e 0.05-1 0.05 20 DBScan FN-DBScan K 5-20 5 4

Bayesian Net HC/SE Alpha 0.1-1 0.1 10 Bayesian Net HC/SE P 1-2 1 2 Bayesian Net HC/SE * N 0-1 1 2 Bayesian Net HC/SE * MBC 0-1 1 2 Bayesian Net HC/SE * R 0-1 1 2 Bayesian Net HC/SE ** S 0-5 1 6

* Parameters with 0-1 value range are booleans ** S {0=Bayes, 1=MDL, 2=Entropy, 3=AIC, 4=Cross_Classic, 5=Cross_Bayes}

33  

Page 34: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

In Table 4 the Algorithm and Variant columns uniquely identify the AI algorithm which utilizes the input

parameters. The value range to test for each algorithm variants parameter is specified by the Parameter, Value

Range, and Step Size columns. Step Size represents the incremental value with which to iterate the parameters value

range. The step count column was added for reference and it represents the number of steps required to cover the

value range, which can be used to calculate the expected number of permutations. In terms of permutations, running

the Original DBScan with a HC/SE Bayesian Network requires 172,800 permutations and the FN-DBScan variant

requires 768,000.

3.4.2 The Data Engine

The Data Engine package was designed to help manage the data flow during the execution of the

experiment. The purpose of DataSet is to encapsulate various operations such as removing attributes and preparing

the training and testing sets using the cross-validation process. The key purpose of DataObject was to allow

additional information for an Instance to be collected, in the case of this experiment a way to easily access the

cluster number was necessary. DataSetFold is used to provide a consistent lookup for each fold numbers train/test

partition which ensures the same arrangement of instances for each fold will occur as the permutation set is iterated.

The CSVFileReader is used to read in the CSV file defining the algorithms and their respective parameter

permutations to test. Figure 14 below overviews the structure of this package.

Figure 14: Data Engine

3.4.2.1 Cross Validation

K-Fold cross validation is a process that randomly partitions the dataset into training/testing partitions k

times. The purpose of cross-validation is to help guard against training the algorithms under study to fit the only

available data, which is commonly known as over-fitting. Cross Validation is achieved by randomly selecting a test

subset from the complete data set to simulate the possibility of future data. For-each fold, various statistics

34  

Page 35: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

surrounding performance and accuracy of the algorithms under study are gathered, which upon completion of all

iterations are averaged to obtain representatives values.

3.4.2.2 Attribute Maintenance

When training/building a classifier or clusterer for experimentation it is important to remove the class label,

unique id, and any redundant/irrelevant attributes. Currently this experiment manually selects the attributes to

remove, however future work will automate the selection using an attribute selection measure such as Information

Gain [54, 55]. Furthermore, this experiment requires the dataset to be altered slightly for various tasks. During the

DBScan phase the training data is classless, but upon completion the ClusterNumber attribute is added as a nominal

attribute. It is important that ClusterNumber is set as the nominal class prior to running Bayes’, which does not

support continuous valued attributes.

3.4.3 The AI Engine

The AI Engine, shown in Figure 15, encapsulates the various requirements of the Data Mining techniques

under study and leverages several core Weka AI algorithms.

Figure 15: AI Engine The BayesNetWrap class serves to encapsulate the experiments Bayesian based process. The process

consists of training a Bayesian classifier with the noiseless training set, then using this trained classifier to classify

the noise, and finally re-training the Bayesian classifier with the full training set. This process produces accuracy

results based on Pre-Noise Classification testing and Post-Noise Classification testing. The DBScanWrap component encapsulates both FN-DBScan and DBScanV2 for the experiments testing

purposes and utilizes the ClusterSet to store information about ClusterObjects. In terms of efficiency the ClusterSet

should be an integral part of DBScan, but is currently generated after the buildCluster process has been run, thus

adding an additional O(n) requirement to the computational complexity. Future work should continue the re-design

of DBScanV2 which is an alteration of the original Weka version; the alterations have currently been limited to

35  

Page 36: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

simply allowing better sub-classing for FNDBScan. For similar reasons SequentialDatabaseV2 re-implemented

SequentialDatabase to allow better sub-classing for the FNSequentialDatabase. The algorithms for DBScan and FN-

DBScan are contrasted in Table 5. The yellow highlighting has been used to mark areas in which the algorithms

differ.

Table 5: DBScan vs. FN-DBScan

Original DBScan FN-DBScan [Start] 1. Set parameters minpts/mincard and e-radius 2. Set all data objects as unclassified, then [Process Data Set] [Process Data Set] 3. For each data object X in the data set

If X is unclassified then [Expand Cluster(C, X)] ‐ Where C is the current cluster

[Expand Cluster(C, X)] 4. seedSetX = [Get Seed Set(X)]

5. If [Is Core Point(seedSetX)]

5.1. Add X to cluster C and remove X from seedSetX 5.2. While seedSetX.size > 0

5.2.1. Get next object Y from seedSetX and Remove Y from seedSetX

5.2.2. If Y is unclassified OR noise

5.2.2.1. seedSetY = [Get Seed Set(Y)]

5.2.2.2. If [Is Core Point(seedSetY)] Then add seedSetY to seedSetX

[Get Seed Set(X)]

‐ Get all objects within E-Neighbourhood of X [Is Core Point(X, seedSet)]

‐ If seedSet.size >= minpoints then return true

[Start] 1. Set parameters minpts/mincard, e-radius, k, and f 2. Set all data objects as unclassified, then [Process Data Set] [Process Data Set] 3. For each data object X in the data set

If X is unclassified then [Expand Cluster(C, X)] ‐ Where C is the current cluster

[Expand Cluster(C, X)] 4. seedSetX = [Get Seed Set(X)]

5. If [Is Core Point(seedSetX)]

5.1. Add X to cluster C and remove X from seedSetX 5.2. While seedSetX.size > 0

5.2.1. Get next object Y from seedSetX and Remove Y from seedSetX

5.2.2. If Y is unclassified OR noise

5.2.2.1. seedSetY = [Get Seed Set(Y)]

5.2.2.2. If [Is Core Point(seedSetY)] Then add seedSetY to seedSetX

[Get Seed Set(X)]

‐ Get all objects within E-Neighbourhood of X [Is Core Point(X, seedSet)]

‐ CARD = Neighborhood Relation(X, seedSet) o The neighborhood relation is specified by parameter f.

The parameter k is also used in f.

o Cardinality (CARD) is calculated by applying f to all objects within the neighborhood.

‐ If CARD >= mincard then return true

3.4.3.1 Original DBScan Clustering

Density based clustering has the ability to detect naturally occurring groupings within a set of objects

without the need for human supervision. These groupings, known as clusters, represent object classifications. The

obvious naturally occurring class label for a clustered instance is simply the cluster number to which it belongs. For

the purpose of testing this experiment, the class attribute for each training instance is set to the nominal

ClusterNumber, it is nominal due to requirements of the Bayesian Network implementation.

36  

Page 37: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

An implementation of the original DBScan algorithm, as shown in Table5, is supplied with Weka. This

algorithm required some minor alterations to allow better inheritability but was preserved as much as possible. FN-

DBScan, for instance, is ideally a subclass of the original DBScan but required many of the internal class properties

to be declared as protected rather than private. An important note about the Weka (3.6.6) DBScan clusterInstance

method is that it does not work how one might initially suspect. This method takes an instance as a parameter but the

implementation ignores the input parameter and simply returns the cluster number of a data object based on an

internal counter. This means that as long as the entire data set (instance set) is iterated the counter will properly

retrieve the cluster number of the instance at the indicated position. In this way the cluster numbers assigned to the

instances can be gathered without adding significant alterations to the DBScan algorithm found in Weka.

3.4.3.2 Implementing FN-DBScan Clustering

Implementing FN-DBScan required a clustering class to be derived from DBScanV2 and a Sequential

Database class derived from SequentialDatabaseV2 (see section 3.4.3). These classes implement the specific

functionality to allow the additional parameters and the alterations to the core point test shown above in Table 5.

The current implementation of FN-DBScan only allows for the Exponential Neighborhood Relation function, but

this could easily be expanded with additional functions and even an additional input parameter to specify the

function to use. Although not explored in the experiment the implementation added an option parameter to

normalize the neighborhood cardinality which would constrain the value between 0 and 1; this value is simply a

measure of density, and in essence nullifies the importance of magnitude, which will cause it to behave differently.

3.4.3.3 Bayesian Network Classification

Classification using Bayesian networks is based on finding the maximal posterior probability for a class

given a set of evidence. The Weka API offers many variations of Bayesian networks by allowing the search

algorithm and estimator to be passed as parameters. The search algorithm is used to learn the structure of the

network based on the dataset’s attributes and the estimator is used to learn the conditional probability distributions

for each node in the learned structure. Currently this experiment focuses only on the Local Hill Climber search

algorithm and the Simple Estimator. Furthermore, the training data provided to the Bayesian network currently does

not allow for missing values and it must have all attributes discretized (no continuous values). The DataSet class

implements the Weka unsupervised method of discretization which is used prior to building a Bayesian classifier.

37  

Page 38: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

3.5Summary

This chapter discussed a Weka based implementation used to conduct the experiment described in the proceeding

chapter. The implementation leverages both clustering and classification techniques and gathers various

measurements used during analysis. Although the scope of this experiment is small, the implementation was

designed to be broadened in future work.

38  

Page 39: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

CHAPTER IV

ANALYSIS

To effectively conduct this experiment the overall success of the previously proposed system must be

measured. All measurements are taken using the IRIS dataset and fall loosely into three categories: Time, Accuracy,

and Noise. Additionally, this experiment will focus on comparing DBScan & FN-DBScan with a Bayesian Network

using a simple estimator and a local hill climber search. The following sections start by introducing the IRIS dataset

and the parameters passed to the implementation. Next, the optimal parameters are discovered for both the DBScan

& FN-DBScan variants of the experiment. And finally, an analysis of the results generated by focusing on the

optimal parameters is conducted and discussed.

4.1 Parameter Selection for the IRIS Dataset

To help analyze the success of the proposed system the IRIS dataset [58] was selected for two important

reasons. First, it is a very popular data set which is used in many pattern recognition experiments [58] as-well-as

research directly related to this experiment [17]. And second, the relatively small size of the dataset (150 entities)

made the time complexity of this experiment more viable without considerable optimization considerations. The

IRIS data set consists of 150 instances with 50 instances existing in each of the 3 classifications. Existing research

[17] has established that performing density based clustering can yield 67% accuracy (or a 33% error rate). A

common outcome for clustering the IRIS dataset is a 2 cluster solution in which one cluster contains only iris-setosa

and the other contains both iris-versicolor and iris-virginica –thus producing a 66.66% (67%) success rate.

In the following test cases 10-Fold cross validation is performed against the dataset under study using a

broad and exhaustive parameter permutation engine. Once the results of the permutation engine have been collected

analysis of the optimal set of parameters is conducted by independently observing the average accuracy achieved by

each parameter value. To help initially limit the values of interest Figure 16 and 18 use the Weka explorer to analyze

the accuracy ratings of various parameter values. The Y Axis on all figures presented is the percentage of errors

when running the final Bayesian classifier against the test cases (post noise classification).

The Bayesian Classifier is trained to classify test cases into cluster numbers. During the testing process the

cluster number to which a test case is assigned is used to find the max-class representing the cluster. This max-class

is then compared with the actual class of the test cases, and an error is produced if they differ.

39  

Page 40: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

After analysis is completed the full permutation set is reduced based on criteria which independently imply

optimal solutions. With this reduced dataset the data is sorted and patterns meeting optimal solution requirements

are observed. Once an optimal pattern is selected it is run through an intensive set of tests to ensure that an accurate

representation of its average accuracy is established. This intensive process consists of performing 3 rounds of

10000 iterations using 3-fold, 6-fold, and 10-fold cross validation against the specified parameters, from which the

results are averaged.

4.2 Optimal Parameter Discovery for DBScan and Bayes’ HC/SE

The first broad exploration against the Iris dataset used the original DBScan algorithm and a Bayesian

Network with local hill climbing search and a simple estimator which required testing 134,400 parameter

permutations and resulted in approximately 2 hours of execution time (7,609,118 ms). The parameter trends shown

in Figure 16 are generated using the Weka explorer and are based on analysis of the permutations after an initial

filtering based on DBScan noise levels of 80% or more reduced this set to 127,680 records (this will vary slightly

between runs). The most observable trend is e-radius shown in 16b which visually demonstrates that an e-radius

value between 0.15 and 0.45 has a high concentration of accurate results. The trends for MinPts (16a), Alpha (16c),

and S (16d) are difficult to observe from the graph and all seem to imply a large percentage of high accuracy results.

Additionally we can observe that good results can be obtained with the total cluster count is between 2 and 4.7

(16e). And finally, a noise level between 0 and 10% seems to be an acceptable range found in the high accuracy

permutations.

a. DBScan MinPts b. DBScan e-radius

c. DBScan Alpha d. DBScan S

40  

Page 41: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

e. DBScan cluster count f. DBScan noise

Figure 16: DBScan Parameter Analysis

To help further limit our set of possible high accuracy permutations there are several strategies

implemented. First, only permutations achieving a 33.33% (or better) accuracy rating need to be observed which

reduces the set to 26582 entries. This set represents the parameter permutations that achieved optimal solutions.

Next, to isolate the optimal set of parameters, each parameter will be analyzed individually to determine which value

has the largest number of entries; the values with the largest number, for each parameter, will be chosen as the

proposed optimal set.

In Figure 17, high accuracy membership counts for e-radius were gathered and summarized on the chart.

Put another way, the chart shows the number of highly accurate permutations (33.3% or better) for each important

value of e-radius (0.15-0.45). The chart shows that when e-radius is between 0.35 and 0.45 the highest number of

highly accurate permutations result (all are exactly 4648).

Figure 17: High accuracy membership counts for e-radius

Performing a similar analysis of the remaining parameters produces the findings summarized below in Table 6 and

will be used to generate the results discussed shortly.

Table 6: DBScan Optimal Parameters

Parameter/Result High Performers Selected Best MinPts 3, 5 5 E-Radius 0.35-0.45 0.4 Alpha 0.4-0.8 0.6 P 1 1 N 0 0

41  

Page 42: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

MBC 1 1 R 0 0 S 0, 4, 5 0

 

4.3 Optimal Parameter Discovery for FN-DBScan and Bayes’ HC/SE

The next broad exploration test is based on the Iris dataset using the FN-DBScan algorithm and a Bayesian

Network with local hill climbing search and a simple estimator. This test required running 768,000 parameter

permutations and resulted in an execution time of around 12hrs (43,665,616 ms). In the following figures the trends

shown are based on analysis of the parameter permutations tested this process. For each permutation, if the noise

level produced by any of the folds is 80% or greater the system will not build a Bayesian classifier and will therefore

not produce any results for analysis; these entries are purged from the initial dataset by purging based on the

IsBayesRun filter, which reduces the dataset to only 479,040 records. As with the analysis of the original DBScan,

the permutation set can be further limited. Observing only permutations achieving a 33.33% (or better) accuracy

rating reduces the set to 37,870 optimal solution entries.

Figure 18, shown below, helps visualize the patterns of these optimal solutions found with 10-Fold Cross

Validation. The coloring represents cluster noise, blue being low noise and orange being high noise. It is a point of

importance that optimal solutions achieving accuracy greater than 33.3% (better the DBScan) have substantially

more noise; and thus rely more heavily on the classifier to classify the noise.

a. MinCard b. e-radius

c. K d. Alpha

42  

Page 43: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

e. P f. N

g. MBC h. R

k. S

Figure 18: FN-DBscan Parameter Analysis

Although the above figure gives a definite visual direction for optimal parameter values, some parameters such as

Alpha and S, require additional exploration. To achieve this, the parameters will be analyzed individually to help

determine an optimal parameter configuration using the method described during DBScan parameter analysis. The

findings for this investigation imply the parameters found in Table 7 representing an optimal solution.

Table 7: FN-DBScan Optimal Parameters

Parameter/Result Independent Best Selected Best MinCard 10 10 E-Radius 0.15 0.15 K 15 15 Alpha 0.2-0.5 0.3 P 1 1 N 0 0 MBC 0 1 R 0-1 0 S 0-5 3

43  

Page 44: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

4.4 Optimal Parameter Analysis

4.4.1 Accuracy Analysis

To confirm the results implied by the parameters in the selected best columns a narrowly focused algorithm

was designed to test the selected permutation 3 times using 10000 iterations of 10-Fold, 6-Fold, and 3-Fold cross

validation. In all test cases discussed below the results produced by 10-Fold cross-validation are the best. This is due

to the simple fact that the full data set is split into 10 testing sets, which in the case of IRIS means there are only 15

test cases in each fold and 135 training cases. Put another way, more folds means larger training sets and smaller test

sets, and with larger training sets a tendency towards increased accuracy is observable.

The results for a focused run using the DBScan selected best parameters found in Table 6 and the FN-

DBScan selected best found in Table 7, are summarized below in Table 8. The 10-Fold DBScan results for both the

test error and the training error generated a 33.3% error rate, which is in agreement with previous research [17]. As

expected both 6-Fold and 3-Fold however performed slightly worse, coming in at 34.68% and 34.18% respectively.

Furthermore, the results of this focused algorithm in all cases show that the cluster count is exactly 2 and the noise

level is 0%.

Table 8: Accuracy Analysis

Folds Train/Test

count Algorithm Train Error

Pre-Noise Classify

Train Error Post-Noise

Classify

Test Error Pre-Noise Classify

Test Error Post-Noise

Classify

Noise Clusters

10 135 / 15 DBScan 33.33 % 33.33 % 33.33 % 33.33 % 0% 2 6 125 / 25 DBScan 33.06 % 33.06 % 34.68 % 34.68 % 0% 2 3 100 / 50 DBScan 33.01 % 33.01 % 34.18 % 34.18 % 0% 2

10 135 / 15 FN-DBScan 40.9% 28.3% 38.9% 30.4% 39.3% 2.7 6 125 / 25 FN-DBScan 48.4% 34.8% 42.7% 36.9% 47.2% 2.3 3 100 / 50 FN-DBScan 65.5% 55.6% 59.3% 58.5% 65.2% 1.5

By performing a focused FN-DBScan run on the optimal pattern found in Table 7 an ErrTestPostNC value of 30.4%

is achievable, which represents an accuracy gain over using the traditional DBscan. Furthermore, it is interesting to

note that the actual training set achieves a 28.3% accuracy rating which is significantly better than the original

DBScan (33.33%).

4.4.2 Noise Classification Analysis

Noise classification is a method which may enhance standard clustering by first training a classifier with

clustered objects and then using that classifier to assign noise objects to a best-fit cluster. By using such a hybrid

44  

Page 45: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

methodology in this experiment accuracy increases are observable as shown below in Table 9 which represents

averages across the entire permutation set based on the noise level. An interesting pattern is that both the original

DBscan and FN-DBScan optimally benefit from noise classification when the amount of noise is between 11-20%.

This pattern merits additional future study to determine whether this will generally hold true with other data sets.

Table 9: Noise Classification Analysis

Noise Level DBScan FN-DBScan Accuracy Increase

Average Accuracy (ErrTestPostNCPER)

Accuracy Increase

Average Accuracy (ErrTestPostNCPER)

0-10% 0.08% 56% 0.2% 58% 11-20% 6.74% 40% 4.9% 42% 21-30% 5.27% 54% 2.6% 52% 31-40% - - 0.7% 63% 41-50% 3.45% 52% 1.2% 60% 51-60 6.83% 51% 0.4% 55% 61-70 3% 54% 2.0% 56% 71-80 1.27% 64% 0.1% 67% 81-90 - - - -

91-100 - - - -

4.4.3 Time Analysis

Both the DBScan and FN-DBScan perform at a similar time complexity level which is shown below in

Table 10. As expected, however, the FN-DBScan requires slightly more time due to its more complicated core point

calculations. In the table below the “Entire Process” column represents the average time to perform the experiment

for a single permutation. “DBScan Build” is the portion of the experiment dealing with only building the clusters,

where-as “DBScan All” consists of not only building the clusters but also the various book keeping activities

required to ensure the cluster numbers are available to the noise classification phase. It is interesting that although a

time increase occurs with FN-DBScan in the clustering phase, the Bayesian classification phase is slightly less on

average, which in real-time applications where ad-hoc classification occurs very frequently could provide a

significant performance increase. Furthermore, it is apparent that utilizing Bayes’ classification for ad-hoc

classifications would yield significant performance increases over re-running the clustering process.

Table 10: Time Analysis

Entire Process DBScan All

DBScan Build

Bayes All

BayesTrain PreNC

BayesTrain PostNC

Bayes Classify

DBScan 55.7 4.21 4.06 1.26 0.61 0.63 0.0008

45  

Page 46: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

FN-DBScan 56.82 4.76 4.58 1.18 0.56 0.6 0.0007

4.5Summary

This experiment demonstrated several important performance gains. First it is shown that FN-DBScan

produces a higher level of accuracy then the original DBScan. Next, performing noise classification in cases with a

moderate amount of noise will potentially increase accuracy. And, finally using a Bayesian Network for ad-hoc

classification (into clusters) is significantly faster then re-generating the clusters.

46  

Page 47: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

CHAPTER V

RECOMMENDATIONS AND CONCLUSIONS

5.1 Suggestions for Further Research

To help further verify the results of this experiment it should be expanded beyond the IRIS dataset. Future

work should continue to expand on exploring the advantages found by combining a clustering algorithm with noise

classification on other datasets. The current system architecture should allow for the addition of new datasets but

will require some slight modifications to allow for missing data.

Broadening this experiment also requires additional consideration in terms of time complexity. The current

permutation engine is exhaustive but requires a significant amount of time to run. Therefore, more intelligence in the

parameter permutation engine would allow a better coverage without extreme computation time. A genetic

algorithm with random mutations could be well suited to this task. A guiding principle of evolutionary algorithms is

known as a fitness test. The purpose of this test is to measure the success (or health) of a given population. This test

could be governed by fuzzy parameters such as “moderate noise” and “slightly low cluster count” to help direct the

generation of test cases.

Once time complexity is addressed and some measure of fitness is in place, it would be interesting to

explore additional clustering/classification mechanisms, including the many Bayesian Network variants, Artificial

Neural Networks, and other fuzzy clustering methods. It would also be of notable interest to explore which

clustering methods and data sets produce noise patterns that produce higher rates of classification success; in the

previous analysis section it was noted that optimal classification success occurred when the noise level was between

11-20%.

Another point of importance is furthering Weka as a research platform. Weka is a very useful tool,

however, as with any tool there is always room for improvement. The current implementation of DBScan is limited

in its ability to be an effective super class which led to the development of DBScanV2 and SequentialDatabaseV2.

Future work on the Weka implementations of DBScanV2 and SequentialDatabaseV2 should focus on offering the

ability to better maintain a lookup of data objects with their respective assigned cluster number and also allow a

more effective overloading of the core point test. Additionally, in order to properly link a new implementation of

47  

Page 48: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

Weka’s DBScan to existing research it will be important to ensure comparison testing is performed between the

original and the newly proposed implementations.

With an effective base class for DBScan in place it will become substantially easier to experiment with

variants, particularly those which simply mutate the core point test which is ultimately the crux of neighborhoods

and the formation of a clusters shape. The core point test is commonly based on the idea of a neighborhood measure

which could be based on a simple point count (original DBScan) or a more complicated measure such density or

magnitude based on a neighborhood relation function (FN-DBScan). Future work should continue to explore ways

to mutate the core point test.

5.2 Conclusions

The results of this paper have empirically shown that by using FN-DBScan a Bayesian Network can be

trained and verified using 10-Fold cross validation on the IRIS dataset to achieve a 30.4% classification accuracy

rate on the testing data and 28.3% on the training data. The classic DBScan algorithm can only achieve 33.33%

accuracy on both the testing and training data. The accuracy increase achievable through the use of FN-DBScan is

due to the generated noise which can then be classified using a Bayes’ Net; noise generated by the original DBScan

does not offer any increases beyond the standard 33.33% accuracy. Although further analysis is required on

additional datasets, these results imply that when verified with a Bayesian classifier, FN-DBScan may be more

effective and accurate when clustering similar data objects and isolating noisy data, and require only a minimal

increase in computational time. Furthermore, leveraging a trained Bayesian Network for ad-hoc classification into

previously defined clusters, offers significant performance increases over re-running DBScan. Overall a hybrid FN-

DBScan and Bayesian network solution has the potential to offer increased accuracy and efficiency when dealing

with real world data analysis situations.

48  

Page 49: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

REFERENCES

[1] Y. Li, J. Han, and J. Yang, “Clustering Moving Objects,” KDD, pp.617-622, 2004.

[2] R. Cheng, D.V. Kalashnikov, and S. Prabhakar, “Evaluating probabilistic queries over imprecise data,” SIGMOD, pp.551-562, 2003.

[3] E. Januzaj, H. P. Kriegel and M. Pfeifle, “Scalable Density-Based Distributed Clustering,” The 15th European Conference on Machine Learning (ECML) and the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Pisa, Italy, September 2004.

[4] H.P. Kriegel, P. Kunath, M. Pfeifle, and M. Renz, “Approximated Clustering of Distributed High Dimensional Data,” PAKDD, pp. 432-441. 2005.

[5] A. Kolmogorov, Foundations of the Theory of Probability, 2nd ed., New York: Chelsea, 1956.

[6] G. Shafer and V. Vovk, "The origins and legacy of Kolmogorov's Grundbegriffe," [Online]. Available: http://www.probabilityandfinance.com/articles/04.pdf. [Accessed 02 December 2012].

[7] K. Murphy, "A brief introduction to Bayes' Rule," [Online]. Available: http://www.cs.ubc.ca/~murphyk/Bayes/bayesrule.html. [Accessed 5 December 2011].

[8] J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference, San Francisco, CA: Morgan Kaufmann, 1988.

[9] N. L. Zang and D. Poole, “A simple approach to Bayesian network computations,” Proc. 10th Canadian Conference on Artificial Intelligence, pp. 171-178. 1994.

[10] R. D. Shachter, B. D’Ambrosio, and B.A. Del Favero, “Symbolic probabilistic inference in belief networks,” AAAI, pp. 126-131, 1990.

[11] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 3rd Ed, New Jersey: Prentice Hall, 2010.

[12] B. Coppin, Artificial Intelligence Illuminated, Sudbury: Jones and Bartlett Publishers, 2004.

[13] S. Russell, J. Binder, D. Koller, and K. Kanazawa, “Local learning in probabilistic networks with hidden variables,” Proc. 1995 Joint Int. Conf. Artificial Intelligence (IJCAI), pages 1146–1152, 1995.

[14] L.A. Zadeh, Fuzzy sets, Information and Control, 8:338-353, 1965.

[15] A. Kandel and W. Byatt, “Fuzzy Sets, Fuzzy Algebra and Fuzzy Statistics”, Proceedings of the IEEE, vol 66, no 12, pp. 1619-1639, 1978.

[16] J.-S.R. Jang, C.-T. Sun, and E. Mizutani, Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence, New Jersey: Prentice Hall, 1996.

[17] J.K. Parker, L.O. Hall, and A. Kandel, "Scalable fuzzy neighborhood DBSCAN," IEEE International Conference on Fuzzy Systems (FUZZ), pp.1-8, 18-23 July 2010.

[18] A.K. Jain and R.C. Dubes, Algorithm for Clustering Data, New Jersey: Prentice Hall, 1998.

[19] J. Han and M. Kamber, Data Mining: Concepts and Techniques, San Diego: Acad. Press, 2001.

[20] W.D. Fisher, "On grouping for maximum homogeneity," J. Amer. Statist. Assoc., Vol. 53, pp. 789-798, 1958.

[21] A. K. Jain, M.N. JMurty, and P.J. Flynn, Data Clustering: A Review, ACM Computing Surveys, Vol. 31, No.3, pp.265-323, 1999.

[22] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: an Introduction to Cluster Analysis, John Wiley & Sons, 1990.

[23] E. Januzaj, H. P. Kriegel and M. Pfeifle, “DBDC:Density Based Distributed Clustering”, Proc. 9th International Conference on Extending Database Technology (EDBT), Heraklion, Greece, pp. 88-105, 2004.

[24] K. Jain Anil, Algorithms for Clustering Data, New Jersey: Prentice Hall, 1988.

[25] H. Samet, The Design and Analysis of Spatial Data Structures, Reading, MA: Addison-Wesley, 1990.

49  

Page 50: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

[26] R.H. Gueting, “An Introduction to Spatial Database Systems,” The VLDB Journal 3(4), pp. 357-399, 1994.

[27] T. Abraham and J.F. Roddick, “Survey of spatio-temporal databases,” GeoInformatica 3(1), pp. 61–99, 1999.

[28] J. Han, M. Kamber, and A.K.H. Tung, “Spatial clustering methods in data mining: a survey,” Geographic Data Mining and Knowledge Discovery, London: Taylor & Francis, 2001.

[29] D. Birant and A. Kut, “ST-DBSCAN: an algorithm for clustering spatial–temporal data,” Data & Knowledge Engineering 60, pp. 208–221, 2007.

[30] M. L. Yiu, N. Mamoulis, “Clustering Objects On a Spatial Network,” SIGMOD, pp.443-454, 2004.

[31] M. Ester, H.P. Kriegel, J. Sander, X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226-231, 1996.

[32] T. Ali, S. Asghar, N.A. Sajid, "Critical analysis of DBSCAN variations," International Conference on Information and Emerging Technologies (ICIET), pp.1-6, 14-16 June 2010.

[33] E.N. Nasibov and G. Ulutagay, “Robustness of density-based clustering methods with various neighborhood relations, “ Fuzzy Sets and Systems, Volume 160, Issue 24, Pages 3601-3615, 16 December 2009.

[34] P. Viswanath, R. Pinkesh, "l-DBSCAN: A Fast Hybrid Density Based Clustering Method," 18th International Conference on Pattern Recognition (ICPR), pp.912-915, 2006.

[35] Y. Wu, J. Guo, X. Zhang, "A Linear DBSCAN Algorithm Based on LSH," International Conference on Machine Learning and Cybernetics, vol.5, pp.2608-2614, 19-22 August 2007.

[36] B. Liu, "A Fast Density-Based Clustering Algorithm for Large Databases," International Conference on Machine Learning and Cybernetics, pp.996-1000, 13-16 August. 2006.

[37] S. Jiang, X. Li, "A Hybrid Clustering Algorithm," Sixth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp.366-370, 14-16 Aug. 2009.

[38] Y. El-Sonbaty, M.A. Ismail, M. Farouk, "An efficient density based clustering algorithm for large databases," 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 673- 677, 15-17 Nov. 2004.

[39] R.T. Ng and J. Han, ”Efficient and Effective Clustering Methods for spatial data mining,” Proc. 20th Int. Conf. on Very Large Data Bases, Santiago, Chile, pp.144-155, 1994.

[40] M. Ankerst, M.M. Breunig, H.P. Kriegel, and J. Sander, "OPTICS: Ordering Points to Identify the Clustering Structure," Proc. of the ACM SIGMOD'99 International Conference on Management of Data, Philadelphia, PA, pp. 49-60, 1999.

[41] O. Uncu, W.A. Gruver, D.B. Kotak, D. Sabaz, Z. Alibhai, C. Ng, "GRIDBSCAN: GRId Density-Based Spatial Clustering of Applications with Noise," IEEE International Conference on Systems, Man and Cybernetics (SMC), vol.4, pp.2976-2981, 8-11 October 2006.

[42] D. Dumitrescu, B. Lazzerini, and L. C. Jain, Fuzzy Sets and Their Application to Clustering and Training, New York: CRC Press LLC, 2000.

[43] F. Hopper, F. Klawonn, R. Kruse, T. Runkler, Fuzzy Cluster Analysis, England: John Wiley & Sons, 1999.

[44] E.N. Nasibov, “A robust algorithm for fuzzy clustering problem on the base of fuzzy joint points method,” Cybernetics and Systems Analysis 44 (1), 2008.

[45] E. N. Nasibov, G. Ulutagay, “A new unsupervised approach for fuzzy clustering,” Fuzzy Sets and Systems, Volume 158, Issue 19, pp.2118-2133, October 2007.

[46] H.P. Kriegel and M. Pfeifle, “Density-based clustering of uncertain data,” In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (KDD), pp.672-677. 2005.

[47] H.-P Kriegel, K. Kailing, A. Pryakin, M. Schubert, “Clustering Multi-Represented Objects with Noise,” PAKDD, pp.394-403, 2004.

50  

Page 51: ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the

51  

[48] J. C. Dunn, “A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters,” Journal of Cybernetics, 3:32–57, 1973.

[49] J.C. Bezdek, "A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms," IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.PAMI-2, no.1, pp.1-8, January 1980.

[50] J.M. Keller, M.R. Gray, J.A. Givens Jr., “A fuzzy K-nearest neighbor algorithm,” IEEE Trans., Syst., Man, and Cybern., Vol.15, No.4, pp. 580–585, 1985.

[51] T.M. Cover and P.E. Hard, “Nearest neighbor pattern classification,” IEEE Trans. Inform. Theory, vol. IT-13, pp. 21-27, Jan 1967.

[52] W. Jiacai, G. Ruijun, "An Extended Fuzzy k-Means Algorithm for Clustering Categorical Valued Data," International Conference on Artificial Intelligence and Computational Intelligence (AICI), vol.2, pp.504-507, 23-24 Oct. 2010.

[53] A. Tepwankul and S. Maneewongwattana, "U-DBSCAN : A density-based clustering algorithm for uncertain objects," IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp.136-143, 1-6 March 2010.

[54] J. R. Quinlan, “Induction of decision trees”, Machine Learning, 1:81-106, 1986.

[55] C.E. Shannon and W. Weaver, The mathematical theory of communication, Illinois: University of Illinois Press, 1949.

[56] Z. Pawlak, “Rough sets,” International Journal of Computer and Information Sciences, 11, pp. 341-356, 1982.

[57] J. H. Holland, Adaptation in natural and artificial systems, Ann Arbor: The University of Michigan Press, 1975.

[58] R. A. Fisher, "IRIS Data Set," UCI Machine Learning Repository, [Online]. Available: http://archive.ics.uci.edu/ml/datasets/Iris. [Accessed 5 December 2011].