ATHABASCA UNIVERSITY Modeling Uncertainty in datasets with ...dtpr.lib.athabascau.ca/action/download.php?... · The measurements for the solutions effectiveness serve to answer the
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ATHABASCA UNIVERSITY
Modeling Uncertainty in datasets with FN-DBScan and Bayesian Networks
2.2 Data Clustering ................................................................................................................................. 21
2.3 Applications for uncertainty .............................................................................................................. 26
Figure 4: The Age Variable ......................................................................................................................................... 17
Figure 7: Fuzzification – Crisp Value to Fuzzy Value ................................................................................................ 19
Figure 8: Probability of Success .................................................................................................................................. 20
Figure 10: Density Connectivity .................................................................................................................................. 24
Figure 11: System Overview ....................................................................................................................................... 31
Figure 12: System Data Flow ...................................................................................................................................... 32
We exist in a world rife with magnitudes of uncertain and imprecise data. The era when rigid boundaries
between classifications would placate our analytic requirements is gone. In this world modeling uncertainty provides
a way to deal with an all-to-often partial view of reality for the ever growing flow of data. And it is this perpetual
flow of data which drives an insatiable need for effective clustering and rapid classification of real world data.
Data clustering is a mostly unsupervised method used to group data into natural partitions while requiring a
minimal amount of human interaction. A popular clustering strategy is known as Density Based Spatial Clustering
of Applications with Noise (DBScan) which is capable of discovering arbitrarily shaped clusters and detecting noise.
A technique known as Fuzzy-Neighborhood DBScan (FN-DBScan) enhances the mechanism which the original
DBScan used to discover core points; core points form the backbone of a clusters shape. FN-DBScan is based on the
idea of passing a Fuzzy Neighborhood Relation Function (F-NRF) as a parameter to the clustering algorithm. The F-
NRF is a flexible concept which determines the strength with which a given point gravitates towards becoming a
core point based on its neighborhood. A common F-NRF is exponential, but it is also trivial to model the original
DBScan’s point count using this concept.
Data classification is used to classify new data based on the knowledge obtained from existing data. A
trained classifier represents a model of the dataset under study and will produce a best fit answer when given a new
data object. One such technique, which is deeply rooted in probability theory, is known as a Bayesian Network
(BN). A BN uses a probabilistic logic model in the form of a structured probability network to deduce a
classification with the highest probability. Classification can potentially assist the clustering process by attempting
to classify noisy data points and also perform rapid ad-hoc cluster assignment to new points.
A widely accepted method to deal with uncertain boundaries is known as fuzzy set theory. Fuzzy set theory
was developed to mathematically support a system where objects have multiple set memberships, each of which
represents a degree of belief in the set membership. Due to this inherent ability to belong to multiple sets, fuzzy
values provide a way to deal with non-crisp (fuzzy) knowledge.
9
1.1 Statement of the Purpose
The purpose of this experiment is to measure the success of a hybrid AI solution which is capable of semi-
supervised learning through the use of both clustering and classification with a focus on fuzzy methods. In the
current experiment, the clustering methods under observation include DBScan and FN-DBScan with the
classification process using a Bayesian Network with hill climbing search and a simple estimator. This experiment
will serve to explore three primary questions: First, is FN-DBScan more effective than the original DBScan?
Second, can additional accuracy be gained by performing an additional classification step to classify the noise
generated by the clustering process? And finally, what performance gain is achievable by using Bayesian
classification on new data, rather than re-running the clustering process?
1.2 Chapter Organization
The subsequent sections are organized as follows. Chapter two introduces background knowledge and
previous research dealing with uncertainty and data clustering. Once the background is established chapter three
describes the experiment and the system architecture used to generate the experimental results. Next, given the
generated results chapter four provides an analytical study into how the results support the questions under study.
Finally a conclusion and suggestions for further research are given.
1.3 Definition of Terms
Acronym Definition DBScan Density Based Spatial Clustering of Applications with Noise FN-DBScan Fuzzy-Neighborhood DBScan SFN-DBSCAN Scalable FN-DBScan F-NRF Fuzzy Neighborhood Relation Function BN Bayesian Network AI Artificial Intelligence GA Genetic Algorithm LLN Law of large numbers CLT Central limit theorem DAC Directed Acyclic Graph JPD joint probability distribution CPT conditional probability table MF Membership Function DIL Dilation CON Concentration INT Contrast Intensification DIM Diminishing COG center of gravity NRFJP Noise robust fuzzy joint points KNN K-Nearest Neighbor
10
xFKM fuzzy k-means CSV Comma Separated Values
11
CHAPTER II
REVIEW OF RELATED LITERATURE
Uncertainty is an ever present issue surrounding data analysis that plagues the process with potential error
at every step. This chapter will discuss several techniques which can help mitigate the effects of uncertainty. The
first technique is based on probability theory and is presented in the form of Bayesian Classification. The next is a
branch of set theory known as fuzzy set theory. And finally a branch of data mining known as density based
clustering will be introduced through a discussion of the DBScan algorithm.
2.1 Uncertainty
There are many instances where absolute data precision is simply not possible [1, 2, 3, 4], which is
particularly true when dealing with real-world models pertaining to time and space which have an infinite level of
detail. There are two primary branches in the field of uncertainty. Probability theory is highly objective and is based
on the establishment of causal relationships through the collection of empirical data, and fuzzy set theory which is
usually considered more subjective in that the relationships modeled are based on the modeler’s impression of the
variable in question. A couple other important uncertainty techniques are rough set and evolutionary theory. Rough
set theory, developed by Pawlak [56], is similar to fuzzy theory in that they both represent uncertainty about set
memberships. Rough set theory, however, uses the concept of set boundaries defined through topology operations to
express vagueness of membership rather than the fuzzy method of using membership functions to establish a degree
of belief. An important aspect of the Rough set approach is that it does not require any preliminary meta-data such
as the membership functions required by fuzzy sets which makes it an appealing candidate for unsupervised
solutions. Evolutionary theory represents a field of study which is influenced by biological models for evolution
such as genetics. A popular evolutionary design is known as a Genetic Algorithm (GA) which was predominantly
pioneered by Holland [57] and uses a probabilistic transition scheme through populations. The purpose of a GA is to
effectively and efficiently evolve an optimal solution to the problem under study by first applying a fitness test to
determine the strongest members of a population and then evolving new populations based on these members until a
terminating condition is met.
2.1.1 General Probability Theory
12
The undeniable need to understand uncertain events has led to a branch of mathematics known as
probability theory. Kolmogorov's Grundbegriffe [5] represents a milestone synthesis establishing the fundamental
foundation for modern probability theory [6]. Two fundamental axioms emerge from the foundations of probability,
the Law of large numbers (LLN) and the Central limit theorem (CLT), both of which apply to large sequences of
independent random variables. LLN simply states that as a sample set increases the sample average will begin to
converge at the probability distribution P, where-as CLT states that given a sufficiently large sample set the random
variables will be approximately normally distributed.
When modeling a system using probability theory the thematic objects consist of random (stochastic)
variables, random processes, and events. An event represents the set of possible outcomes where each outcome is
known as a random variable. The value associated with a random variable can be modeled using a probability
distribution. In the simplest case, which assumes discrete time, a sequence of random variables represents a random
process known as a time series. The two classes of probability distributions used to model random variables are
discrete and continuous. A discrete probability distribution deals with event distributions that occur in a countable
sample space, such as the probability of drawing a specific card from a 52-card deck. Continuous probability
distributions deal with event distributions that occur in a continuous sample space, such as a real number
measurement.
2.1.2 Bayesian Classification
The famous Bayes’ Theorem is commonly credited to the English mathematician and theologian Thomas
Bayes (1702 to 1761), although there is some evidence which implies otherwise [7]. This theorem is the basis for
much research dealing with situations that lack certainty, including Bayesian networks. Bayes’ theorem is based on
the concept of conditional probabilities (posterior probabilities):
P(B | A) = P(B ^ A) / P(A) Equation 1 – Posterior Probability
Which reads the probability of B given A is equal to the probability of the intersection between B and A (both B and
A are true) divided by the probability of A. In the above formula P(B|A) is referred to as the posterior probability of
B. This idea is the basis for Bayes’ Theorem which is often written as follows:
P(B | A) = [P(A | B) * P(B)] / P(A) Equation 2 - Posterior Probability #2
In its simplest implementation, Naïve Bayes’ (Simple Bayesian) directly applies Bayes’ theorem and will predict
that A belongs to the class having the highest posterior probability, P(B|A), conditioned on A. A key characteristic
13
of this algorithm is its simplicity, which is achieved in the design by adopting class conditional independence by
treating all attribute class values as independent of other attribute class values.
2.1.2.1 Bayesian networks
Bayesian networks (BN’s) were first introduced by Pearl [8]. A key aspect of the design is the capability of
encoding variable dependencies as a causal model. To achieve dependent variables a finite set of random attributes
(variables) are positioned within a Directed Acyclic Graph (DAC) to form a causal topology of nodes which encodes
the dependencies in the form of a network. Each node within the topology has a joint probability distribution (JPD),
also-known-as a conditional probability table (CPT), which represents the nodes value probability given a specific
combination of direct parent variable values. It is an important condition of Bayesian Networks that only the direct
parents need to be considered for the CPT; in this sense, BN’s satisfy the Markov condition, which means, that
given their parents the nodes within a network are conditionally independent of their non-descendants. Because of
this condition the JPD can be factorized to the following equation:
P(X1, X2, …, Xn) = Πi P(Xi | PA(Xi))
Equation 3 – Bayesian Network JPD
Factorization
In the above mentioned Equation 3, Π is the product symbol, X1… Xn are the set of variables in the DAC, and
PA(Xi) represents the parents of Xi in the network.
2.1.2.2 Network Inference
In a Bayesian Network inference is achieved by maximizing the posterior (conditional) probability of a
class for a given set of attribute values. The simplest method of performing Bayesian inference is known as the
enumeration algorithm [9], which can be greatly enhanced by applying variable elimination [10]. Put concisely the
enumeration algorithm is a recursive algorithm that performs a summation of products of the conditional
probabilities while ensuring probabilistic normalization [11] and in this way is able to achieve a maximal posterior
probability. Figure 1, adapted from [12], visualizes a simple trained network which can be used to infer the
probabilities of outcomes based on a set of evidence. For example, what is the probability that someone goes to
college (C), Studies (S), Does not party (¬P), Succeeds in their exams (E), and does not have fun (¬F)? The
calculation that follows can be traced by observing the highlighted areas in Figure 1.
which has proven popular through its many variants [32, 33] designed to mitigate various limitations. The
fundamental idea behind DBScan is the core object which is discovered by analyzing the E-Neighborhood and
forms the backbone of the clusters shape.
The E-Neighborhood is the set of objects within the radius (ε ) of the object in question. The neighborhood
of object X is shown as the dotted surrounding circle in Figure 9. All y objects, with exception of y6, are within the
radius 5.0=ε and are therefore considered to be part of object X’s neighborhood. To discover the set of neighbors
the simplest form of DBScan iterates the dataset and measures the distance between the object in question (object x)
and the object currently being iterated (object y). These distances are the numbers beside the dotted line connecting
the two objects in the figure. A common distance function is known as Euclidean distance, given in the below
equation. In order to apply this function to measuring the distance between data objects it is important to note that it
is summing the distance between each attribute squared, where (x1i - x2i) is the distance between the attribute for the
two objects.
22
∑=
−=n
iii xxxxdist
1
22121 )(),( Equation 5 – Euclidean Distance
It is often desirable to normalize the attribute values into a standard range such as [0, 1]. A common method to
achieve normalization is known as minmax and is given in Equation 6 shown below. It is important to realize that
minmax requires the maximum and minimum values for the attribute to which v (the value) belongs.
AA
Avvminmax
min)max(min−
−= Equation 6 – MinMax Normalization
Figure 9: E-Neighborhood
An object becomes a Core Object if the E-Neighborhood contains a number of objects equal to at least
MinPts. If an object is within the E-Neighborhood of a core object then it is directly density reachable from that core
object. Furthermore, two objects are indirectly density reachable if there is a path of core objects between them. A
walk of paths through the connectivity network is known as density connectivity.
In Figure 10, MinPts = 3 and E-Radius create the spheres encompassing the neighborhoods of the various
points. The core points in this example are C, D and E, each of which contain 3 points. Furthermore, point B is
directly density reachable from point C because B is within the neighborhood of point C which is a core point; point
F is directly density reachable from point E for similar reasons. B is indirectly density reachable from point E
23
because there is a path of core point neighborhoods between them; E is not indirectly density reachable because B is
not a core point; F and C have a similar relationship. Points B,C,D,E and F are all considered to be density
connected and would represent a discovered cluster in DBScan. The points A and G are outliers and would be
flagged as noise.
Figure 10: Density Connectivity
2.2.3 DBScan Varients
DBScan is an elegant algorithm with great potential, but it is not without its limitations. The four major
difficulties plaguing DBScan implementations in current research include computation complexity, memory
consumption, datasets with varying densities, and parameter selection. The computational complexity of DBScan is
capable of achieving the undeniably poor rating of O(n^2) [34]; albeit, its normal rating is O(nlogn). Various studies
have addressed this issue in certain circumstances and demonstrated performance ratings of linear [35,36] and near-
linear [37]. As dataset size increases the memory resources required by DBScan will grow proportionally due to its
algorithmic requirement to load the entire dataset into memory. Achieving a more scalable algorithm is addressed in
various branches of research [3, 38, 39]. The issue of varying densities has also been acknowledged in literature as a
critical limitation of the DBScan algorithm. The algorithm OPTICS [40], for example, is a cluster analysis method
concerned with computing an augmented cluster ordering. This augmented data can be used to enable varying
densities. Additionally, this ordering is conducive of human analysis and/or automated parameter selection. Uncu et.
al. [41] further extend the notion of OPTICS by developing an algorithm that focuses on density parameter values
for regions rather than across all data vectors.
2.2.4 Fuzzy Clustering
It has been said that the application of fuzzy set theory to the clustering problem can increase adequacy of
the results obtained [42]. There are currently two main branches of fuzzy clustering in research, the first defines
24
fuzzy boundaries between object classes, and the second regards the objects themselves as fuzzy representations of
their true values. This paper will focus on the former of these.
2.2.4.1 Fuzzy Neighborhood Relations
Traditional fuzzy clustering [43] evolved from a need for non-crisp, or fuzzy, boundaries between clusters
of objects. Traditionally each object in a dataset has a degree of membership for a set of clusters, where the
membership value is calculated by the specified MF. This is the premise behind the Fuzzy Neighborhood Relation
Function (F-NRF) [33] method which combines NRFJP (Noise robust fuzzy joint points) [44, 45] with DBScan [31]
to create FN-DBSCAN. This algorithm is further enhanced with scalability in [17] when Parker, Hall, and Kandel
combine FN-DBSCAN with SDBDC [3, 23] to produce an algorithm known as SFN-DBSCAN. In F-NRF the
fuzzy membership is established by measuring the sum of all comparisons between an object and its’ neighborhood
objects, where the comparison is defined by the neighborhood relation function.
2.2.4.2 Neighborhood Relation Function
A neighborhood relation function is used to calculate what has been referred to as the neighborhood
cardinality [17] which represents the cumulative density of the objects (y) surrounding the object (x) in question.
The Cardinality calculation is shown below in Equation 7 which produces a summation of the exponential
neighborhood relation function shown Equation 8.
)()( yNxCard xNy∑=ε
Equation 7 – Cardinality Calculation
⎥⎥⎦
⎤
⎢⎢⎣
⎡⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎠⎞
⎜⎝⎛−=
2
max
),(**1exp)(d
yxdistKyN x
Equation 8 – Exponential Neighborhood
Relation Function
In Equation 8 dmax is the maximum distance possible between two data objects and dist(x, y) represents the actual
distance between the two objects in question. The parameter K, such that K>0, provides additional sensitivity, and in
the case of Equation 8 has an effect on the radius [33]. Although implementations may vary, it is common to use the
distance of 1 if two nominal attributes differ and 0 if they are the same. Furthermore if an attribute is numeric it is
often normalized using MinMax normalization.
25
2.2.4.3 Fuzzy Object Representations
Another branch of the fuzzy clustering paradigm, referred to as Fuzzy Object Representations [46], views
the objects themselves as fuzzy rather than the boundaries separating the different classifications. This paradigm
embraces the imprecise nature of measurement data and allows modeling uncertainty in terms of knowledge about
the actual objects; this is achieved by using fuzzy distance functions to measure the distance between fuzzy objects.
Furthermore, the solution established in [46], is able to deal with Multi-Represented Objects as described in [47].
Multi-represented objects are objects in which queries need to be made at multiple resolutions, which can be thought
of as fuzzy queries; an example of this would be a spatial map which requires queries to be made at various
resolutions of detail.
2.2.4.4 Other fuzzy clustering methods
Fuzzy clustering first emerged in 1973 when Dunn [48] defined a fuzzy generalization for the min-variance
partitioning problem, which is better known was fuzzy k-means and was further refined by Bezdek [49]. In 1985
Keller, Grey and Givens [50] extended Cover and Hard’s K-Nearest Neighbor (KNN) algorithm [51] with fuzzy set
theory. A significant limitation of early clustering methods is the poor ability to deal with non-numeric data types.
Recent research by Jiacai and Ruijun [52] describes an algorithm referred to as extended fuzzy k-means (xFKM)
which incorporates techniques used for clustering categorical data. Because of this a key advantage of xFKM is that
it does not require converting nominal attributes into binary attributes which it achieves by using the categorical
dissimilarity measure.
2.3 Applications for uncertainty
The effect of uncertainty can be found in any application requiring the measurement of real world
phenomena or data entry and/or collection conducted by human agents. Global Position Systems (GPS) provide
position tracking of moving objects, but due to realities in timing and physical measurement it will always contain a
certain degree of uncertainty [53], which also extends to any measurements pertaining to moving objects [1, 2, 3, 4].
Another important aspect of uncertainty comes from human error and deception. Human error is commonly
associated with data entry related tasks with which a certain probability of error is always present. Deception
pertains to the possibility that data entered may have been falsified during the collection process.
26
To help accommodate uncertainty in data mining tools need to be developed to allow crisp data sets to be
interrelated using fuzzy concepts. In the following sections the Fuzzy-Neighborhood DBScan algorithm will be
implemented for experimentation and its performance compared with the original DBScan.
2.4Summary
This chapter introduced research surrounding several important techniques addressing uncertainty issues in
data analysis. The chapter started with an introduction to general probability theory which then worked into the
areas of Bayesian Networks and Fuzzy Set theory. Additionally data clustering with a focus on the density based
DBScan and FN-DBScan was discussed followed by a section describing various areas in which uncertainty raises
issues. The following chapter proposes a system architecture that will harmonize the use of Bayesian classification,
fuzzy set theory, and the density-based DBScan clustering algorithm.
27
CHAPTER III
METHODOLOGY AND DESIGN
The architecture implemented for this experiment is a Weka based hybrid model which uses both clustering
and classification to establish and maintain a clustered dataset. The clustering strategies implemented are DBScan
and FN-DBScan, and the classification strategy currently uses a Bayesian Network to classify noise from the
clustering process. To verify the experimental results the architecture was designed to gather various measurements
pertaining to the accuracy and efficiency of the algorithms used. In this design the Weka API is directly accessible
and extendible by the architecture by adding the API as an external reference library to the Eclipse workspace. To
enhance Weka’s extendibility of the original DBScan a new version, DBScanV2, was implemented as a slightly
modified variant.
3.1 The Experiment Tools
This experiment utilizes the Weka API in a Java simulation developed with the Eclipse IDE and was run
using an Intel i7 quad-core machine with 12 gb of memory. For data analysis of the parameter permutations the
heap-size allocated for the Weka explorer was also increased to 8192mb. Further analysis was performed using 32-
bit MS Excel which required ensuring the permutation result dataset was less than 1,048,576 entries.
3.2 Measurements & Parameters
The product of executing the broad exploration portion of the code is an output CSV file containing various
measures for each permutation explored. For each entry there will be a set of permutation parameters, accuracy
measures, and timing measures. The permutation parameters are used to initialize the hybrid algorithm which will
establish a test case for measurements to be taken from. The measurements, for each test case, relate to the
performance of the clustering algorithm (DBScan/FN-DBScan) and the effectiveness of the Bayesian Network.
3.2.1 Permutation Parameters
Permutation parameters represent the input parameters used for the result records test case, and will be
further discussed in the permutation engine section. Table 1 shown below describes the various parameters which
are used in this experiment.
28
Table 1: Permutation Parameters
Column Algorithm Description MinPts DBScan Determines the number of required points within a given neighborhood to
signify a core point. MinCard FN-DBScan Determines the minimum cardinality within a given neighborhood to signify a
core point. The concept of cardinality represents a measure that encompasses both density and magnitude.
K FN-DBScan A constant value used with the neighborhood relation functions. e DBScan / FN-
DBScan The radius measure used to determine which objects fall within an objects neighborhood.
Alpha Bayes’ Used in the calculation of conditional probabilities. P Bayes’ The maximum number of parents used in the search algorithm. N Bayes’ If true (1) then it sets the initial structure to empty instead of naïve bayes. MBC Bayes’ If true (1) then a Markov Blanket correction is applied to the learned network
structure. R Bayes’ If true (1) then use the arc reversal operation. S Bayes’ The score type such that
The primary measure for the accuracy of a permutation is based on the percentage of errors in the testing
set after noise classification has occurred (PostNC), but observing the error in the training set is also useful. For both
sets (testing and training) there is a Pre Noise Classification (PreNC) and Post Noise Classification (PostNC) error
value. The noise classification process pertains to using the Bayesian classifier to classify the noise generated by the
clustering process. In Table 2, shown below, the set of various accuracy measures are summarized.
Table 2: Accuracy Measures
Column Description ClusterCount The average number of clusters produced during the clustering process. ClusterNoisePER The average percentage of noise produced during the clustering process. IsBayesRun The average number of times the Bayesian classifier was run; it won’t run if the
noise level is equal-to or greater-than 80%. A value of 1.0 indicates Bayes is always run and 0.0 indicates it is never run.
ErrTestPreNCPER The percentage of incorrectly classified objects in the testing set during the Pre-Noise Classification (PreNC) step.
ErrTestPostNCPER The percentage of incorrectly classified objects in the testing set during the Post-Noise Classification (PostNC) step.
ErrTrainPreNCPER The percentage of incorrectly classified objects in the training set during the Pre-Noise Classification (PreNC) step.
ErrTrainPostNCPER The percentage of incorrectly classified objects in the training set during the Post-Noise Classification (PostNC) step.
29
The measure of error for the testing set is the percentage of unsuccessfully classified test objects using the
input Bayesian classifier which would be the PreNC classifier or the PostNC classifier. Successful classification is
determined by comparing the test objects actual class with the maximal class found within the cluster. Unlike test
cases, measuring error in the training set does not directly use the classifier, and the PreNC error is unaffected by
classification which makes it a direct measure of only the clustering process. Just like training cases the rate of error
in test cases is measured by comparing the objects actual class with the cluster numbers maximal class, training
objects however differs in that any objects flagged as noise by the clusterer are automatically considered error.
3.2.3 Timing Measures
Time is a critical factor in the performance of real world applications. Table 3 below describes the various
values gathered by this experiment pertaining to time.
Table 3: Timing Measures
Column Description TimeEntirePermutation The time taken to run through the entire K-Fold process for the given permutation.
10 Fold Cross Validation, for example, would run the process once for each fold accumulating values to produce averages.
TimeDBScanALL The time taken to perform the entire clustering process, including the extra steps required to make the cluster number readily available for the classification phase.
TimeDBScanBuild The time taken to perform the actual clustering process when using the specified DBScan variant.
TimeBayesAll The time taken to perform the entire Bayesian process. TimeBayesTrainPreNC The time taken to remove the noise and train the initial classifier. TimeBayesTrainPostNC The time taken to classify the noise and retrain the Bayesian classifier. TimeBayesClassify Average time taken to classify and object during PreNC and PostNC classification. 3.2.4 Noise and Noise Classification measures
A measure of noise, ClusterNoisePER, is given as a percentage which represents the amount of singleton
(unclustered) data objects. Furthermore, both ErrTrainPostNC and ErrTestPostNC represent accuracy measures
after Bayesian noise classification has occurred. These values are used to help determine if there is any added
benefit in adding a noise classification phase after clustering occurs.
3.3 The System Design Overview
This system was designed to execute a set of test cases, also referred to as parameter permutations, in a way
that a set of empirical results are generated for further analysis. The design is composed of three core packages
shown in Figure 11: the permutation engine, the data engine, and the AI engine. The purpose of the permutation
30
engine is to encapsulate the mechanisms used to iterate through a set of possible parameter permutations while
gathering test case measurements to be used for further analysis. Next, the data engine provides a consolidation of
common tasks associated with dataset and data object management. And finally, the AI engine provides wrapper
classes for DBScan/FN-DBScan and Bayesian Networks which implement the hybrid logic and measurement
tracking required to conduct the experiment. In figure 11, the AppMain and App classes are used to provide the
logic pertaining to this specific experiment. Each of these packages will be discussed further in subsequent sections.
Figure 11: System Overview
A key feature of the design is the ability to repeat any number of K-Fold cross validation iterations to
generate an average for each parameter permutation it attempts. For the test cases used in this experiment the
permutations were explored in two ways. The first method provided a broad exploration by iterating the permutation
space defined by an input CSV file (see Table 4 in section 3.4.1.1) and the second method allowed the specific
permutations to be set and attempted in rapid succession to obtain a more accurate average. By conducting the broad
exploration first potential parameter combinations yielding desirable outcomes can be isolated for further
exploration. Once selected each of the candidates becomes the focus of the narrow but exhaustive second method to
establish an average set of representative measurements.
There are three elements of flow in the application shown in Figure 12: altering the training set, building
the clusters and classifiers, and performing tests using the testing set. Training set alterations occur at different
stages to prepare the data for the next step. These alterations include preparing a classless dataset for clustering,
updating the classless dataset with a cluster number, and removing noise for classifier training. Next, building the
clusters and classifiers is the central task of this experiment and starts with using the training data to establish
naturally occurring classifications of objects using DBScan or FN-DBScan. Following this all noisy objects are
31
removed from the dataset and the initial Bayesian Network is trained using the Cluster Number as the class. By
applying the initial Bayesian classifier to the noise additional classifications are discovered, and a final Bayesian
network is trained. Finally, accuracy is measured by comparing the maximal class name found in a cluster number
with the expected class value for a data object; this method is applied to both the training data and the testing data.
The testing data, however, gives a more realistic measurement of how well the trained classifier will perform when
given a set of unknown data.
Figure 12: System Data Flow
3.4 The System Design Components
3.4.1 The Parameter Permutation Engine
Input parameters are a critical component of many clustering and classification solutions and it is often
difficult to quickly establish which parameters will generate an optimal solution. To help with this process a
parameter permutation engine was designed which is capable of iterating through all possible permutations
following the rules defined in an input CSV file (see Table 4). The rules simply describe the algorithm and its
parameters in terms of testable ranges and the size of each iterative step. The permutation engine is also responsible
for storing the various measures gathered during each iteration. Each measure is stored in a PermResultObject which
32
maintains the average and min/max aggregates until the object is cleared. Figure 13, shown below, describes the
general structure of this package.
Figure 13: Permutation Engine
To understand the time complexity of this engine, it is important to calculate the number of possible values
for each input parameter and then multiply the totals together. So if there were n variables (V1…Vn), and their
cardinality |V1| represented the number of possible values for that variable, the number of permutations P could be
So, if there were 25,000 permutations, each of which took 100ms to execute, the engine would require 2500
seconds, or about 42 minutes to run. In terms of both time complexity and storage this is far from perfect and future
work may focus on applying evolutionary algorithms to guide test cases.
3.4.1.1 Parameter permutations
It is often very difficult to select the input parameters required by various data mining algorithms. In order
to provide a thorough parameter result set for analysis, the testing methodology iterates through many input
parameter permutations for the data set under study. For each of these input parameter permutations 10-fold cross
validation is used to derive a set of averages to be recorded. The input parameter permutations for the test cases
discussed shortly were generated from a user defined CSV file containing the following:
Table 4: Parameter Permutation Input Rules
Algorithm Varient Parameter Value Range Step size Step Count DBScan Original MinPts 2-8 1 9 DBScan Original e 0.05-1 0.05 20 DBScan FN-DBScan MinCard 5-50 5 10 DBScan FN-DBScan e 0.05-1 0.05 20 DBScan FN-DBScan K 5-20 5 4
Bayesian Net HC/SE Alpha 0.1-1 0.1 10 Bayesian Net HC/SE P 1-2 1 2 Bayesian Net HC/SE * N 0-1 1 2 Bayesian Net HC/SE * MBC 0-1 1 2 Bayesian Net HC/SE * R 0-1 1 2 Bayesian Net HC/SE ** S 0-5 1 6
* Parameters with 0-1 value range are booleans ** S {0=Bayes, 1=MDL, 2=Entropy, 3=AIC, 4=Cross_Classic, 5=Cross_Bayes}
33
In Table 4 the Algorithm and Variant columns uniquely identify the AI algorithm which utilizes the input
parameters. The value range to test for each algorithm variants parameter is specified by the Parameter, Value
Range, and Step Size columns. Step Size represents the incremental value with which to iterate the parameters value
range. The step count column was added for reference and it represents the number of steps required to cover the
value range, which can be used to calculate the expected number of permutations. In terms of permutations, running
the Original DBScan with a HC/SE Bayesian Network requires 172,800 permutations and the FN-DBScan variant
requires 768,000.
3.4.2 The Data Engine
The Data Engine package was designed to help manage the data flow during the execution of the
experiment. The purpose of DataSet is to encapsulate various operations such as removing attributes and preparing
the training and testing sets using the cross-validation process. The key purpose of DataObject was to allow
additional information for an Instance to be collected, in the case of this experiment a way to easily access the
cluster number was necessary. DataSetFold is used to provide a consistent lookup for each fold numbers train/test
partition which ensures the same arrangement of instances for each fold will occur as the permutation set is iterated.
The CSVFileReader is used to read in the CSV file defining the algorithms and their respective parameter
permutations to test. Figure 14 below overviews the structure of this package.
Figure 14: Data Engine
3.4.2.1 Cross Validation
K-Fold cross validation is a process that randomly partitions the dataset into training/testing partitions k
times. The purpose of cross-validation is to help guard against training the algorithms under study to fit the only
available data, which is commonly known as over-fitting. Cross Validation is achieved by randomly selecting a test
subset from the complete data set to simulate the possibility of future data. For-each fold, various statistics
34
surrounding performance and accuracy of the algorithms under study are gathered, which upon completion of all
iterations are averaged to obtain representatives values.
3.4.2.2 Attribute Maintenance
When training/building a classifier or clusterer for experimentation it is important to remove the class label,
unique id, and any redundant/irrelevant attributes. Currently this experiment manually selects the attributes to
remove, however future work will automate the selection using an attribute selection measure such as Information
Gain [54, 55]. Furthermore, this experiment requires the dataset to be altered slightly for various tasks. During the
DBScan phase the training data is classless, but upon completion the ClusterNumber attribute is added as a nominal
attribute. It is important that ClusterNumber is set as the nominal class prior to running Bayes’, which does not
support continuous valued attributes.
3.4.3 The AI Engine
The AI Engine, shown in Figure 15, encapsulates the various requirements of the Data Mining techniques
under study and leverages several core Weka AI algorithms.
Figure 15: AI Engine The BayesNetWrap class serves to encapsulate the experiments Bayesian based process. The process
consists of training a Bayesian classifier with the noiseless training set, then using this trained classifier to classify
the noise, and finally re-training the Bayesian classifier with the full training set. This process produces accuracy
results based on Pre-Noise Classification testing and Post-Noise Classification testing. The DBScanWrap component encapsulates both FN-DBScan and DBScanV2 for the experiments testing
purposes and utilizes the ClusterSet to store information about ClusterObjects. In terms of efficiency the ClusterSet
should be an integral part of DBScan, but is currently generated after the buildCluster process has been run, thus
adding an additional O(n) requirement to the computational complexity. Future work should continue the re-design
of DBScanV2 which is an alteration of the original Weka version; the alterations have currently been limited to
35
simply allowing better sub-classing for FNDBScan. For similar reasons SequentialDatabaseV2 re-implemented
SequentialDatabase to allow better sub-classing for the FNSequentialDatabase. The algorithms for DBScan and FN-
DBScan are contrasted in Table 5. The yellow highlighting has been used to mark areas in which the algorithms
differ.
Table 5: DBScan vs. FN-DBScan
Original DBScan FN-DBScan [Start] 1. Set parameters minpts/mincard and e-radius 2. Set all data objects as unclassified, then [Process Data Set] [Process Data Set] 3. For each data object X in the data set
If X is unclassified then [Expand Cluster(C, X)] ‐ Where C is the current cluster
5.1. Add X to cluster C and remove X from seedSetX 5.2. While seedSetX.size > 0
5.2.1. Get next object Y from seedSetX and Remove Y from seedSetX
5.2.2. If Y is unclassified OR noise
5.2.2.1. seedSetY = [Get Seed Set(Y)]
5.2.2.2. If [Is Core Point(seedSetY)] Then add seedSetY to seedSetX
[Get Seed Set(X)]
‐ Get all objects within E-Neighbourhood of X [Is Core Point(X, seedSet)]
‐ If seedSet.size >= minpoints then return true
[Start] 1. Set parameters minpts/mincard, e-radius, k, and f 2. Set all data objects as unclassified, then [Process Data Set] [Process Data Set] 3. For each data object X in the data set
If X is unclassified then [Expand Cluster(C, X)] ‐ Where C is the current cluster
Both the DBScan and FN-DBScan perform at a similar time complexity level which is shown below in
Table 10. As expected, however, the FN-DBScan requires slightly more time due to its more complicated core point
calculations. In the table below the “Entire Process” column represents the average time to perform the experiment
for a single permutation. “DBScan Build” is the portion of the experiment dealing with only building the clusters,
where-as “DBScan All” consists of not only building the clusters but also the various book keeping activities
required to ensure the cluster numbers are available to the noise classification phase. It is interesting that although a
time increase occurs with FN-DBScan in the clustering phase, the Bayesian classification phase is slightly less on
average, which in real-time applications where ad-hoc classification occurs very frequently could provide a
significant performance increase. Furthermore, it is apparent that utilizing Bayes’ classification for ad-hoc
classifications would yield significant performance increases over re-running the clustering process.
Table 10: Time Analysis
Entire Process DBScan All
DBScan Build
Bayes All
BayesTrain PreNC
BayesTrain PostNC
Bayes Classify
DBScan 55.7 4.21 4.06 1.26 0.61 0.63 0.0008
45
FN-DBScan 56.82 4.76 4.58 1.18 0.56 0.6 0.0007
4.5Summary
This experiment demonstrated several important performance gains. First it is shown that FN-DBScan
produces a higher level of accuracy then the original DBScan. Next, performing noise classification in cases with a
moderate amount of noise will potentially increase accuracy. And, finally using a Bayesian Network for ad-hoc
classification (into clusters) is significantly faster then re-generating the clusters.
46
CHAPTER V
RECOMMENDATIONS AND CONCLUSIONS
5.1 Suggestions for Further Research
To help further verify the results of this experiment it should be expanded beyond the IRIS dataset. Future
work should continue to expand on exploring the advantages found by combining a clustering algorithm with noise
classification on other datasets. The current system architecture should allow for the addition of new datasets but
will require some slight modifications to allow for missing data.
Broadening this experiment also requires additional consideration in terms of time complexity. The current
permutation engine is exhaustive but requires a significant amount of time to run. Therefore, more intelligence in the
parameter permutation engine would allow a better coverage without extreme computation time. A genetic
algorithm with random mutations could be well suited to this task. A guiding principle of evolutionary algorithms is
known as a fitness test. The purpose of this test is to measure the success (or health) of a given population. This test
could be governed by fuzzy parameters such as “moderate noise” and “slightly low cluster count” to help direct the
generation of test cases.
Once time complexity is addressed and some measure of fitness is in place, it would be interesting to
explore additional clustering/classification mechanisms, including the many Bayesian Network variants, Artificial
Neural Networks, and other fuzzy clustering methods. It would also be of notable interest to explore which
clustering methods and data sets produce noise patterns that produce higher rates of classification success; in the
previous analysis section it was noted that optimal classification success occurred when the noise level was between
11-20%.
Another point of importance is furthering Weka as a research platform. Weka is a very useful tool,
however, as with any tool there is always room for improvement. The current implementation of DBScan is limited
in its ability to be an effective super class which led to the development of DBScanV2 and SequentialDatabaseV2.
Future work on the Weka implementations of DBScanV2 and SequentialDatabaseV2 should focus on offering the
ability to better maintain a lookup of data objects with their respective assigned cluster number and also allow a
more effective overloading of the core point test. Additionally, in order to properly link a new implementation of
47
Weka’s DBScan to existing research it will be important to ensure comparison testing is performed between the
original and the newly proposed implementations.
With an effective base class for DBScan in place it will become substantially easier to experiment with
variants, particularly those which simply mutate the core point test which is ultimately the crux of neighborhoods
and the formation of a clusters shape. The core point test is commonly based on the idea of a neighborhood measure
which could be based on a simple point count (original DBScan) or a more complicated measure such density or
magnitude based on a neighborhood relation function (FN-DBScan). Future work should continue to explore ways
to mutate the core point test.
5.2 Conclusions
The results of this paper have empirically shown that by using FN-DBScan a Bayesian Network can be
trained and verified using 10-Fold cross validation on the IRIS dataset to achieve a 30.4% classification accuracy
rate on the testing data and 28.3% on the training data. The classic DBScan algorithm can only achieve 33.33%
accuracy on both the testing and training data. The accuracy increase achievable through the use of FN-DBScan is
due to the generated noise which can then be classified using a Bayes’ Net; noise generated by the original DBScan
does not offer any increases beyond the standard 33.33% accuracy. Although further analysis is required on
additional datasets, these results imply that when verified with a Bayesian classifier, FN-DBScan may be more
effective and accurate when clustering similar data objects and isolating noisy data, and require only a minimal
increase in computational time. Furthermore, leveraging a trained Bayesian Network for ad-hoc classification into
previously defined clusters, offers significant performance increases over re-running DBScan. Overall a hybrid FN-
DBScan and Bayesian network solution has the potential to offer increased accuracy and efficiency when dealing
with real world data analysis situations.
48
REFERENCES
[1] Y. Li, J. Han, and J. Yang, “Clustering Moving Objects,” KDD, pp.617-622, 2004.
[2] R. Cheng, D.V. Kalashnikov, and S. Prabhakar, “Evaluating probabilistic queries over imprecise data,” SIGMOD, pp.551-562, 2003.
[3] E. Januzaj, H. P. Kriegel and M. Pfeifle, “Scalable Density-Based Distributed Clustering,” The 15th European Conference on Machine Learning (ECML) and the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Pisa, Italy, September 2004.
[4] H.P. Kriegel, P. Kunath, M. Pfeifle, and M. Renz, “Approximated Clustering of Distributed High Dimensional Data,” PAKDD, pp. 432-441. 2005.
[5] A. Kolmogorov, Foundations of the Theory of Probability, 2nd ed., New York: Chelsea, 1956.
[6] G. Shafer and V. Vovk, "The origins and legacy of Kolmogorov's Grundbegriffe," [Online]. Available: http://www.probabilityandfinance.com/articles/04.pdf. [Accessed 02 December 2012].
[7] K. Murphy, "A brief introduction to Bayes' Rule," [Online]. Available: http://www.cs.ubc.ca/~murphyk/Bayes/bayesrule.html. [Accessed 5 December 2011].
[8] J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference, San Francisco, CA: Morgan Kaufmann, 1988.
[9] N. L. Zang and D. Poole, “A simple approach to Bayesian network computations,” Proc. 10th Canadian Conference on Artificial Intelligence, pp. 171-178. 1994.
[10] R. D. Shachter, B. D’Ambrosio, and B.A. Del Favero, “Symbolic probabilistic inference in belief networks,” AAAI, pp. 126-131, 1990.
[11] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 3rd Ed, New Jersey: Prentice Hall, 2010.
[12] B. Coppin, Artificial Intelligence Illuminated, Sudbury: Jones and Bartlett Publishers, 2004.
[13] S. Russell, J. Binder, D. Koller, and K. Kanazawa, “Local learning in probabilistic networks with hidden variables,” Proc. 1995 Joint Int. Conf. Artificial Intelligence (IJCAI), pages 1146–1152, 1995.
[14] L.A. Zadeh, Fuzzy sets, Information and Control, 8:338-353, 1965.
[15] A. Kandel and W. Byatt, “Fuzzy Sets, Fuzzy Algebra and Fuzzy Statistics”, Proceedings of the IEEE, vol 66, no 12, pp. 1619-1639, 1978.
[16] J.-S.R. Jang, C.-T. Sun, and E. Mizutani, Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence, New Jersey: Prentice Hall, 1996.
[17] J.K. Parker, L.O. Hall, and A. Kandel, "Scalable fuzzy neighborhood DBSCAN," IEEE International Conference on Fuzzy Systems (FUZZ), pp.1-8, 18-23 July 2010.
[18] A.K. Jain and R.C. Dubes, Algorithm for Clustering Data, New Jersey: Prentice Hall, 1998.
[19] J. Han and M. Kamber, Data Mining: Concepts and Techniques, San Diego: Acad. Press, 2001.
[20] W.D. Fisher, "On grouping for maximum homogeneity," J. Amer. Statist. Assoc., Vol. 53, pp. 789-798, 1958.
[21] A. K. Jain, M.N. JMurty, and P.J. Flynn, Data Clustering: A Review, ACM Computing Surveys, Vol. 31, No.3, pp.265-323, 1999.
[22] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: an Introduction to Cluster Analysis, John Wiley & Sons, 1990.
[23] E. Januzaj, H. P. Kriegel and M. Pfeifle, “DBDC:Density Based Distributed Clustering”, Proc. 9th International Conference on Extending Database Technology (EDBT), Heraklion, Greece, pp. 88-105, 2004.
[24] K. Jain Anil, Algorithms for Clustering Data, New Jersey: Prentice Hall, 1988.
[25] H. Samet, The Design and Analysis of Spatial Data Structures, Reading, MA: Addison-Wesley, 1990.
49
[26] R.H. Gueting, “An Introduction to Spatial Database Systems,” The VLDB Journal 3(4), pp. 357-399, 1994.
[27] T. Abraham and J.F. Roddick, “Survey of spatio-temporal databases,” GeoInformatica 3(1), pp. 61–99, 1999.
[28] J. Han, M. Kamber, and A.K.H. Tung, “Spatial clustering methods in data mining: a survey,” Geographic Data Mining and Knowledge Discovery, London: Taylor & Francis, 2001.
[29] D. Birant and A. Kut, “ST-DBSCAN: an algorithm for clustering spatial–temporal data,” Data & Knowledge Engineering 60, pp. 208–221, 2007.
[30] M. L. Yiu, N. Mamoulis, “Clustering Objects On a Spatial Network,” SIGMOD, pp.443-454, 2004.
[31] M. Ester, H.P. Kriegel, J. Sander, X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226-231, 1996.
[32] T. Ali, S. Asghar, N.A. Sajid, "Critical analysis of DBSCAN variations," International Conference on Information and Emerging Technologies (ICIET), pp.1-6, 14-16 June 2010.
[33] E.N. Nasibov and G. Ulutagay, “Robustness of density-based clustering methods with various neighborhood relations, “ Fuzzy Sets and Systems, Volume 160, Issue 24, Pages 3601-3615, 16 December 2009.
[34] P. Viswanath, R. Pinkesh, "l-DBSCAN: A Fast Hybrid Density Based Clustering Method," 18th International Conference on Pattern Recognition (ICPR), pp.912-915, 2006.
[35] Y. Wu, J. Guo, X. Zhang, "A Linear DBSCAN Algorithm Based on LSH," International Conference on Machine Learning and Cybernetics, vol.5, pp.2608-2614, 19-22 August 2007.
[36] B. Liu, "A Fast Density-Based Clustering Algorithm for Large Databases," International Conference on Machine Learning and Cybernetics, pp.996-1000, 13-16 August. 2006.
[37] S. Jiang, X. Li, "A Hybrid Clustering Algorithm," Sixth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp.366-370, 14-16 Aug. 2009.
[38] Y. El-Sonbaty, M.A. Ismail, M. Farouk, "An efficient density based clustering algorithm for large databases," 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 673- 677, 15-17 Nov. 2004.
[39] R.T. Ng and J. Han, ”Efficient and Effective Clustering Methods for spatial data mining,” Proc. 20th Int. Conf. on Very Large Data Bases, Santiago, Chile, pp.144-155, 1994.
[40] M. Ankerst, M.M. Breunig, H.P. Kriegel, and J. Sander, "OPTICS: Ordering Points to Identify the Clustering Structure," Proc. of the ACM SIGMOD'99 International Conference on Management of Data, Philadelphia, PA, pp. 49-60, 1999.
[41] O. Uncu, W.A. Gruver, D.B. Kotak, D. Sabaz, Z. Alibhai, C. Ng, "GRIDBSCAN: GRId Density-Based Spatial Clustering of Applications with Noise," IEEE International Conference on Systems, Man and Cybernetics (SMC), vol.4, pp.2976-2981, 8-11 October 2006.
[42] D. Dumitrescu, B. Lazzerini, and L. C. Jain, Fuzzy Sets and Their Application to Clustering and Training, New York: CRC Press LLC, 2000.
[43] F. Hopper, F. Klawonn, R. Kruse, T. Runkler, Fuzzy Cluster Analysis, England: John Wiley & Sons, 1999.
[44] E.N. Nasibov, “A robust algorithm for fuzzy clustering problem on the base of fuzzy joint points method,” Cybernetics and Systems Analysis 44 (1), 2008.
[45] E. N. Nasibov, G. Ulutagay, “A new unsupervised approach for fuzzy clustering,” Fuzzy Sets and Systems, Volume 158, Issue 19, pp.2118-2133, October 2007.
[46] H.P. Kriegel and M. Pfeifle, “Density-based clustering of uncertain data,” In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (KDD), pp.672-677. 2005.
[47] H.-P Kriegel, K. Kailing, A. Pryakin, M. Schubert, “Clustering Multi-Represented Objects with Noise,” PAKDD, pp.394-403, 2004.
50
51
[48] J. C. Dunn, “A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters,” Journal of Cybernetics, 3:32–57, 1973.
[49] J.C. Bezdek, "A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms," IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.PAMI-2, no.1, pp.1-8, January 1980.
[50] J.M. Keller, M.R. Gray, J.A. Givens Jr., “A fuzzy K-nearest neighbor algorithm,” IEEE Trans., Syst., Man, and Cybern., Vol.15, No.4, pp. 580–585, 1985.
[51] T.M. Cover and P.E. Hard, “Nearest neighbor pattern classification,” IEEE Trans. Inform. Theory, vol. IT-13, pp. 21-27, Jan 1967.
[52] W. Jiacai, G. Ruijun, "An Extended Fuzzy k-Means Algorithm for Clustering Categorical Valued Data," International Conference on Artificial Intelligence and Computational Intelligence (AICI), vol.2, pp.504-507, 23-24 Oct. 2010.
[53] A. Tepwankul and S. Maneewongwattana, "U-DBSCAN : A density-based clustering algorithm for uncertain objects," IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp.136-143, 1-6 March 2010.
[54] J. R. Quinlan, “Induction of decision trees”, Machine Learning, 1:81-106, 1986.
[55] C.E. Shannon and W. Weaver, The mathematical theory of communication, Illinois: University of Illinois Press, 1949.
[56] Z. Pawlak, “Rough sets,” International Journal of Computer and Information Sciences, 11, pp. 341-356, 1982.
[57] J. H. Holland, Adaptation in natural and artificial systems, Ann Arbor: The University of Michigan Press, 1975.
[58] R. A. Fisher, "IRIS Data Set," UCI Machine Learning Repository, [Online]. Available: http://archive.ics.uci.edu/ml/datasets/Iris. [Accessed 5 December 2011].