MULTI-LEVEL RELATIONSHIP OUTLIER DETECTION by Qiang Jiang B.Eng., East China Normal University, 2010 a Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the School of Computing Science Faculty of Applied Sciences c Qiang Jiang 2012 SIMON FRASER UNIVERSITY Summer 2012 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for “Fair Dealing.” Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately.
62
Embed
MULTI-LEVEL RELATIONSHIP OUTLIER DETECTIONsummit.sfu.ca/system/files/iritems1/12345/etd7298_QJiang.pdf · thus simply apply the existing outlier detection techniques to tackle the
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Input : input: The relationships to detect.Output: outlierRec: Stores the detected relationship outliers.
outlierType: The outlier type of the relationships.1 Relationship-outlier-detection(input)2 CubingAlgorithm← {TDC,BUC, eBUC};3 for CubingAlgorithm do4 if isOutlier then5 outlierRec.add();6 if isAggregateOutlier then7 compute KL-divergence;8 write outlierType;
9 end
10 end
11 end
4.2 The Top-down Cubing (TDC) Approach
4.2.1 Top-down Cubing
The top-down cubing (TDC) [45] algorithm starts from the base level groups, groups that
do not have any descendant, to the more general levels.
Top-down cubing, represented by Multiway Array Cube, is to aggregate simultaneously
on multiple dimensions. It uses a multidimensional array as its basic data structure. The
cubing process of TDC on a 4-D data cube is presented in Figure 4.1. We present how to
detect outliers using TDC in Section 4.2.2.
4.2.2 Outlier Detection Using TDC
By Definition 5, we notice that given a fact table F and a threshold parameter l, a rela-
tionship is defined as an outlier if |t.Fk+1 − ~m| > lδ, where m is the average and δ is the
standard deviation of the samples. We use two cubing methods to process the cube lattice
in top-down and bottom-up manners. In our example, we use the average charge amount
as the aggregate measurement.
t.Fk+1 = aggr({s.Fk+1|s ∈ F, s is a descendant of t}).
aggr({(avgi, counti)}) = (
∑i avgi × counti∑
i counti,∑i
counti).
CHAPTER 4. DETECTION METHODS 21
Figure 4.1: TDC cubing
Details of the TDC algorithm are provided in Algorithm 2. In the beginning, dimensionRec
contains full dimensions, for example, if the relational input table has 4 dimensions A, B,
C and D, then dimensionRec = {A,B,C,D}. Then we scan each combination of the di-
mensions in dimensionRec and get the average value. If the average value is in the defined
outlier range, then we add the relationship into the outlierRec. If the relationship is a base
level group, we add it into the baseGroup list and set the Boolean value isOutlier, which
is to indicate whether this base level relationship is an outlier or not. If the outlier rela-
tionship is an aggregate group, we add it into the aggregateOutlier list. For the aggregate
groups, we continue to compute the KL-divergence value using the function computeKL-
divergence(). The concept of KL-divergence and details of how this function works are
presented in section 4.5. Then we set the current dimension in dimensionRec to ALL (for
example, dimensionRec= {∗, B,C,D}) and recursively call itself with the start dimension
incremented by 1.
4.2.3 Example of TDC
In this section, we give a running example using the top-down cubing.
Suppose the input table F = (A,B,C,D,M), where A, B, C and D are dimensions
and M is the measurement value. The table has 4 dimensions, namely A = {a1, a2},B = {b1, b2}, C = {c1, c2} and D = {d1, d2}. The original table is listed in Table 4.1, and
the cubing process is in Figure 4.2.
CHAPTER 4. DETECTION METHODS 22
Algorithm 2: TDC(input, startDim, l)
Input : input: The relationships to aggregate.startDim: The starting dimension for this iteration.l: The threshold parameter.
Global : constant numDims: The total number of dimensions.constant m: The average value of the sample relationships.constant δ: The standard deviation of the sample relationships.
Output: outlierRec: Stores the detected relationship outliers.1 TDC(input, startDim, l)2 for i← startDim to numDims do3 currentGroup← dimensionRec . Current combination of dimensions4 avg ← aggregate(currentGroup)5 if |avg −m| > lδ then6 if currentGroup is base then7 baseGroup.add(currentGroup);8 baseGroup.isOutlier = true;
9 else10 aggregateOutlier.add(currentGroup);11 computeKL− divergence(currentGroup); . Get outlier type
12 end13 outlierRec.add(currentGroup);
14 else15 if currentGroup is base then16 baseGroup.add(currentGroup);17 baseGroup.isOutlier = false;
For the first recursive call (Figure 4.2(a)), the TDC algorithm scan the whole table
because it starts from the most generalized groups. So the first recursion starts from the
relationship (A,B,C,D).
The cubing process in Figure 4.1 shows that the second recursion of TDC cubing is in
the relationship (∗, B,C,D). We list the cubing example of the second, third, fourth and
fifth recursions in Figure 4.2(b), Figure 4.2(c), Figure 4.2(d) and Figure 4.2(e) respectively.
Id Relationship Measure
1 (a1, b1, c1, d1) m1
2 (a1, b1, c2, d1) m2
3 (a1, b2, c2, d2) m3
4 (a2, b1, c1, d1) m4
5 (a2, b2, c1, d1) m5
6 (a2, b2, c2, d2) m6
(a) first recursion (A,B,C,D)
Id Relationship Measure
1 (∗, b1, c1, d1) m1+m42
2 (∗, b1, c2, d1) m2
3 (∗, b2, c1, d1) m5
4 (∗, b2, c2, d2) m3+m62
(b) second recursion (∗, B,C,D)
Id Relationship Measure
1 (∗, ∗, c1, d1) m1+m4+m53
2 (∗, ∗, c2, d1) m2
3 (∗, ∗, c2, d2) m3+m62
(c) third recursion (∗, ∗, C,D)
Id Relationship Measure
1 (∗, ∗, ∗, d1) m1+m2+m4+m54
2 (∗, ∗, ∗, d2) m3+m62
(d) fourth recursion (∗, ∗, ∗, D)
Id Relationship Measure
1 (∗, ∗, ∗, ∗) m1+m2+m3+m4+m5+m66
(e) fifth recursion (∗, ∗, ∗, ∗)
Figure 4.2: TDC algorithm example
The algorithm will continue after the five recursions according to the cubing process in
CHAPTER 4. DETECTION METHODS 24
Figure 4.1. When the algorithm finishes the top-down cubing, we can get a list of outliers
and the types of aggregate outliers.
4.3 The Bottom-up Cubing (BUC) Approach
4.3.1 Bottom-up Cubing
The bottom-up cubing algorithm [4] is to build the cube by starting from a group-by on a
single attribute, then a group-by on a pair of attributes, then a group-by on three attributes,
and so on. It processes the cube lattice from the most general group to the base level groups.
The cubing process of BUC on a 4-D data cube is presented in Figure 4.3. BUC begins with
apex cuboid and counts the frequency of the first single dimension and then partitions the
table based on the frequent values. And then it recursively counts the value combinations for
the next dimension and partitions the table in the same manner. BUC can take advantage
of Apriori pruning because if a given cell does not satisfy minimum support, then neither
will any of its descendants. We will present how to detect outliers for our problem using
BUC in Section 4.3.2.
4.3.2 Outlier Detection Using BUC
Details of the algorithm are provided in Algorithm 3. We can use BUC to compute all
aggregate cells, and use the outlier determination criteria |t.Fk+1 − ~m| > lδ as a post-
processing step to capture all outliers, where ~m and δ are the average and the standard
deviation of Fk+1 of all relationships among the base level groups.
The first step of the BUC algorithm is to aggregate the entire input and write the
result. To get aggregate value we need to scan input. When scanning the whole table,
we can get results of baseGroup list as a byproduct. The for loop in line 14 controls the
current dimension, and partitions the current input. On return from Partition(), dataCount
contains the number of records for each distinct value of the d-th dimension while dataAvg
contains the average value of all the samples in the d-th dimension. If the average value is
in the defined outlier range, we add the current dimension and measurement information
into the final outlierRec list. Then the partition becomes the input relation in the next
recursive call with the start dimension incremented by 1. We store the detected outliers into
outlierRec and all the aggregate ones into aggregateOutlier list. For an aggregate outlier
CHAPTER 4. DETECTION METHODS 25
Figure 4.3: BUC cubing
group, we compute the KL-divergence of its descendants to get the outlier type. Details of
the function computeKL-divergence() is described in section 4.5.
One tricky here is that the BUC relies on the monotonicity of iceberg conditions. How-
ever, the iceberg condition in our problem is not monotonic. To tackle with the problem in
this thesis, BUC cannot take advantage of Apriori pruning because the measure here is not
the count value. We use the average as measure, which does not hold the property that if
a given cell is not an outlier, then neither will any of its descendant.
Fortunately, Theorem 3 identifies a weak monotonic property of the iceberg condition
in our problem. We thus can adopt some special iceberg cube computation methods for
weak monotonic conditions. Specifically, we use eBUC [43], which “looks ahead” to check
whether an aggregate cell is an ancestor of some outliers. Details of eBUC are presented in
Section 4.4
4.3.3 Example of BUC
In this section, we give a running example using the bottom-up cubing.
Suppose the input table F = (A,B,C,D,M), where A, B, C and D are dimensions
and M is the measurement value. The table has 4 dimensions, namely A = {a1, a2},B = {b1, b2}, C = {c1, c2} and D = {d1, d2}. The original table is listed in Table 4.2, and
the cubing process is in Figure 4.4.
For the first recursive call (Figure 4.4(a)), the BUC algorithm aggregates the whole
CHAPTER 4. DETECTION METHODS 26
Algorithm 3: BUC(input, startDim, l)
Input : input: The relationships to aggregate.startDim: The starting dimension for this iteration.l: The threshold parameter.
Global : constant numDims: The total number of dimensions.constant m: The average value of the sample relationships.constant δ: The standard deviation of the sample relationships.dataAvg[numDims]: Stores the average value of each partition.dataCount[numDims]: Stores the size of each partition.
Output: outlierRec: Stores the detected outlier relationships.1 BUC(input, startDim, l)2 for currentGroup← tuple do . To Agrrgate(input) and get baseGroup3 avg = currentGroup.measure;4 if |avg −m| > lδ then5 baseGroup.isOutlier = true;6 outlierRec.add(currentGroup);
7 else8 baseGroup.isOutlier = false;9 end
10 end11 if input.count()==1 then12 WriteAncestors(input[0], startDim) ;13 return;
14 end15 for d← startDim to numDims do16 Let C = cardinality[d];17 Partition(input, d, C, dataCount[d], dataAvg[d]);18 Let k ← 0;19 for i← 0 to C do20 Let c = dataCount[d][i], a = dataAvg[d][i];21 currentGroup← dimensionRec . Current combination of dimensions22 if |avg −m| > lδ then23 if currentGroup is aggregate then24 aggregateOutlier.add(currentGroup);25 computeKL− divergence(currentGroup); . Get outlier type
table. So the first recursion starts from the relationship (∗, ∗, ∗, ∗).The cubing process in Figure 4.3 shows that the second recursion of BUC cubing is in
the relationship (A, ∗, ∗, ∗). We list the cubing example of the second, third, fourth and fifth
recursions in Figure 4.4(b), Figure 4.4(c), Figure 4.4(d) and Figure 4.4(e) respectively.
The algorithm will continue after the five recursions according to the cubing process in
Figure 4.3. When the algorithm finishes the bottom-up cubing, we can get a list of outliers
and the types of aggregate outliers.
4.4 The Extended BUC (eBUC) approach
Although BUC is efficient to compute the complete data cube, it cannot take advantage of
the Apriori property to prune in our problem. This is because our problem here does not
satisfy the monotonic condition that if an aggregate relationship is not an outlier, then any
descendent of it must also not be an outlier. If we use BUC directly, we need to compute
the whole data cube without any pruning. To make the BUC algorithm more efficient for
our problem, we use the extended BUC approach [43].
4.4.1 Outlier Detection Using eBUC
The key idea of the eBUC takes advantage of the week monotonic in Theorem 3. If an
aggregate relationship t is an outlier, then there exists at least one descendent of t that is
also an outlier.
Details of the eBUC algorithm is in Algorithm 4. The eBUC algorithm follows the
structure of the BUC algorithm using depth-first search. It first aggregates the whole table
and computes the base level relationships to get a list of base-level relationship outliers as
a byproduct. The difference between eBUC and BUC is that, when eBUC encounters an
CHAPTER 4. DETECTION METHODS 28
Id Relationship Measure
1 (∗, ∗, ∗, ∗) m1+m2+m3+m4+m5+m66
(a) first recursion (∗, ∗, ∗, ∗)
Id Relationship Measure
1 (a1, ∗, ∗, ∗) m1+m2+m33
2 (a2, ∗, ∗, ∗) m4+m5+m63
(b) second recursion (A, ∗, ∗, ∗)
Id Relationship Measure
1 (a1, b1, ∗, ∗) m1+m22
2 (a1, b2, ∗, ∗) m3
3 (a2, b1, ∗, ∗) m4+m62
4 (a2, b2, ∗, ∗) m5
(c) third recursion (A,B, ∗, ∗)
Id Relationship Measure
1 (a1, b1, c1, ∗) m1
2 (a1, b1, c2, ∗) m2
3 (a1, b2, c2, ∗) m3
4 (a2, b1, c1, ∗) m4+m62
5 (a2, b2, c1, ∗) m5
(d) fourth recursion (A,B,C, ∗)
Id Relationship Measure
1 (a1, b1, c1, d1) m1
2 (a1, b1, c2, d1) m2
3 (a1, b2, c2, d2) m3
4 (a2, b1, c1, d1) m4
5 (a2, b2, c1, d1) m5
6 (a2, b1, c1, d2) m6
(e) fifth recursion (A,B,C,D)
Figure 4.4: BUC algorithm example
CHAPTER 4. DETECTION METHODS 29
aggregate relationship t, it “looks ahead”. That is, eBUC checks whether t is an ancestor
of some base-level ourlier relationships. If not, then following Theorem 3, t cannot be an
outlier relationship. Moreover, any descendent of t is not an ancestor of a base-level outlier
relationship. So we do not need to search further to any descendent of t.
4.4.2 Example of eBUC
In this section, we show a running example using eBUC approach. The input table F =
(A,B,C,M), where A, B and C are dimensions and M is the measurement value. The
table has 3 dimensions, namely A = {a1, a2}, B = {b1, b2, b3} and C = {c1, c2}. The
original table is listed in Table 4.3. The average value of the whole table is 17.25 and the
standard deviation is 31.12. Follow the outlier condition that |t.Fk+1 − ~m| > lδ, we can
first get the base table class (Table 4.4) when eBUC runs (l = 1). In this example, we
only have one base-level outlier relationship (a2, b3, c2). Any aggregate relationship that is
not an ancestor of it will be pruned. In Figure 4.5(b), the relationship (a1, ∗, ∗) is pruned
because it is not an ancestor of (a2, b3, c2). Therefore, any descendent of (a1, ∗, ∗) will also
be pruned. In Figure 4.5(c), we only check the descendents of (a2, ∗, ∗).
Id A B C M
1 a1 b1 c1 7
2 a1 b1 c2 9
3 a1 b2 c1 4
4 a1 b2 c2 7
5 a1 b3 c1 10
6 a1 b3 c2 8
7 a2 b1 c1 5
8 a2 b1 c2 11
9 a2 b2 c1 7
10 a2 b2 c2 15
11 a2 b3 c1 4
12 a2 b3 c2 120
Table 4.3: eBUC algorithm example: original table
CHAPTER 4. DETECTION METHODS 30
Algorithm 4: eBUC(input, startDim, l)
Input : input: The relationships to aggregate.startDim: The starting dimension for this iteration.l: The threshold parameter.
Global : constant numDims: The total number of dimensions.constant m: The average value of the sample relationships.constant δ: The standard deviation of the sample relationships.dataAvg[numDims]: Stores the average value of each partition.dataCount[numDims]: Stores the size of each partition.
Output: outlierRec: Stores the detected relationship outliers.1 eBUC(input, startDim, l)2 for currentGroup← tuple do . To Agrrgate(input) and get baseGroup3 avg = currentGroup.measure;4 if |avg −m| > lδ then5 baseGroup.isOutlier = true;6 outlierRec.add(currentGroup);
7 else8 baseGroup.isOutlier = false;9 end
10 end11 if input.count()==1 then12 WriteAncestors(input[0], startDim) ;13 return;
14 end15 for d← startDim to numDims do16 Let C = cardinality[d];17 Partition(input, d, C, dataCount[d], dataAvg[d]);18 Let k ← 0;19 for i← 0 to C do20 Let c = dataCount[d][i], a = dataAvg[d][i];21 currentGroup← dimensionRec . Current combination of dimensions22 if currentGroup is an ancestor of some base-level outlier then23 if |avg −m| > lδ then24 if currentGroup is aggregate then25 aggregateOutlier.add(currentGroup);26 computeKL− divergence(currentGroup); . Get outlier type
Input : aggregateOutlier: The detected outlier aggregate group.Global : baseGroup: The base level groupsOutput: outlierType: The outlier type of the input.
cases and 2 are outliers. For R1, KL(S|S0) = 2.39 and KL(S|S1) = 63.74. KL(S|S0) <KL(S|S1). The distribution of R1 is more similar to the distribution of its normal base
level descendant relationships. Thus, R1 is of type I. R1 being a type-I outlier indicates
that R1 contains some base level strong outliers (those of amount 1000.00 or more), but the
majority in R1 is still normal. To this extent, R1 as an outlier may not be very interesting.
Instead, those outlier descendants of R1 should be checked.
As shown in Table 5.3, R2 has 10 base-level relationships. Among them, 2 are normal
cases and 8 are outliers. KL(S|S0) = 5.01182 and KL(S|S1) = 0.454034. KL(S|S0) >KL(S|S1). The distribution of R1 is more similar to the distribution of its outlier base level
descendant relationships. R2 is of type II. R2 being a type-II outlier indicates that most of
the base level descendant relationships of R2 are outliers, and can be summarized using R2.
5.2 Efficiency and Scalability
We tested the efficiency of the outlier detection methods on both real data set and the
synthetic ones. We compare three cubing methods: TDC [45], BUC [4] and eBUC [43].
TDC and BUC compute the whole cube while eBUC uses the outlier detection criterion
directly to compute an iceberg cube.
We used a larger real data set to test the efficiency, and use random samples of various
Figure 5.1: The running time of TDC, BUC and eBUC with respect to the number of tuples
sizes of the data set to test the scalability. Figure 5.1 shows the scalability of the three
methods on the database size, where the outlier threshold l = 1. Please note that the
smaller the value of l, the more outliers and thus the less pruning power eBUC has. The
three methods have linear scalability. The runtime of TDC and BUC is very close. eBUC
clearly can take the advantage of direct pruning using the outlier detection criterion, and
thus has faster runtime.
Figure 5.2 shows the scalability of the three methods with respect to parameter l and
number of base level relationships on the real data set. The larger the value of l, the less
outliers, and thus the less computation is needed in all three methods to determine the types
of the outliers. Comparing to TDC and BUC, eBUC can further use the outlier criterion to
prune normal relationships during the cubing process and thus reduces the runtime further.
CHAPTER 5. EXPERIMENT RESULTS 40
1 2 3 418
20
22
24
26
l
runt
ime
(s)
TDCBUCeBUC
(a) number of tuples=5,000
1 2 3 422
24
26
28
30
l
runt
ime
(s)
TDCBUCeBUC
(b) number of tuples=10,000
1 2 3 432
34
36
38
40
l
runt
ime
(s)
TDCBUCeBUC
(c) number of tuples=15,000
1 2 3 440
42
44
46
48
50
l
runt
ime
(s)
TDCBUCeBUC
(d) number of tuples=20,000
1 2 3 448
50
52
54
56
58
l
runt
ime
(s)
TDCBUCeBUC
(e) number of tuples=25,000
1 2 3 450
55
60
65
70
l
runt
ime
(s)
TDCBUCeBUC
(f) number of tuples=30,000
Figure 5.2: The running time of TDC, BUC and eBUC
CHAPTER 5. EXPERIMENT RESULTS 41
The larger the value of l, the more advantage eBUC has. This figure also indicates that the
determination of types of outliers takes substantial cost.
Figure 5.3 shows the numbers of relationship outliers with respect to parameter l and
number of base level relationships on the real data set. As explained, the larger the value
of l, the less outliers. The trend in Figure 5.3 is consistent with the runtime trend in
Figure 5.2. Moreover, most of the outliers detected are aggregate ones. There are much
more type-II outliers than type-I ones. The results clearly justify the effectiveness of our
method in summarizing the outlier information.
We also tested the efficiency and scalability using synthetic data sets. We generated
synthetic data sets with dimension attributes in discrete domains and measurement in con-
tinuous domain. We consider 3 factors in out experiments, the dimensionality d, the number
of tuples n in the data set and the distribution on the dimensions (uniform distribution ver-
sus normal distribution).
To be specific, we generated two data sets, each of 100, 000 tuples and 4 dimensions.
The cardinality in each dimension is 20. The tuples in the first data set follow the uniform
distribution in each dimension, and the tuples in the second data set follow the (discretized)
normal distribution. That is, we used the normal distribution µ = 10 and θ = 2 to generate
synthetic data, and round the values to 20 bins in the range of [1, 20]. Figure 5.4 shows the
results, where threshold parameter l = 1. Clearly, our outlier detection methods work much
faster on the data set of normal distribution, where outliers are meaningful.
Figure 5.5 shows the scalability of our detection methods on the synthetic data set
following normal distribution. In Figure 5.5(a), the number of tuples is set to 10, 000.
The runtime increases dramatically when the dimensionality goes up. This is as expected
since computing a data cube of high dimensionality is known challenging. In Figure 5.5(b),
the dimensionality is set to 4, and the number of tuples varies from 100, 000 to 1 million.
Interestingly, the runtime grows slower than the number of tuples. The reason is given the
cardinality of each dimension, the number of possible groupbys is fixed. When a fact table
becomes very large, many aggregate cells will be populated with a non-trivial number of
tuples where the number of groupbys grows slower (Figure 5.6). The results clearly show
that our methods are scalable on large data sets.
CHAPTER 5. EXPERIMENT RESULTS 42
1 2 3 40
200
400
600
l
# of
out
liers
All casesAggregate casesType IType II
(a) number of tuples=5,000
1 2 3 40
200
400
600
800
1000
l
# of
out
liers
All casesAggregate casesType IType II
(b) number of tuples=10,000
1 2 3 40
500
1000
1500
l
# of
out
liers
All casesAggregate casesType IType II
(c) number of tuples=15,000
1 2 3 40
500
1000
1500
l
# of
out
liers
All casesAggregate casesType IType II
(d) number of tuples=20,000
1 2 3 40
500
1000
1500
l
# of
out
liers
All casesAggregate casesType IType II
(e) number of tuples=25,000
1 2 3 40
500
1000
1500
2000
l
# of
out
liers
All casesAggregate casesType IType II
(f) number of tuples=30,000
Figure 5.3: Number of Detected Outliers
CHAPTER 5. EXPERIMENT RESULTS 43
1 2 3 40
200
400
600
800
l
runt
ime
(s)
UniformNormal
(a) TDC algorithm
1 2 3 40
200
400
600
800
l
runn
time
(s)
UniformNormal
(b) BUC algorithm
1 2 3 40
200
400
600
800
l
runn
time
(s)
UniformNormal
(c) eBUC algorithm
Figure 5.4: The running time of TDC, BUC and eBUC with different distributions
CHAPTER 5. EXPERIMENT RESULTS 44
2 4 6 80
1
2
3
4x 10
4
Dimensionality
runt
ime
(s)
TDCBUCeBUC
(a) Dimensionality
0 1 2.5 5 7.5 10
x 105
0
100
200
300
400
# of tuples
runt
ime
(s)
TDCBUCeBUC
(b) Database size
Figure 5.5: Scalability on synthetic data.
0 1 2.5 5 7.5 10
x 105
0
2
4
6x 10
4
# of tuples
# o
f outlie
rs
All cases
Aggregate cases
Figure 5.6: Number of detected outliers
CHAPTER 5. EXPERIMENT RESULTS 45
Chapter 6
Conclusion
In this thesis, we tackle the problem of detecting multi-level relationship outliers. Rela-
tionship is very important in business. Good relationships build a win-win situation for
both consumers and providers. On the other hand, fraudulent relationships destroy normal
business patterns and make a vicious circle between consumers and providers. In this thesis,
we develop a simple yet effective model to detect and handle relationship outliers between
groups and individuals in groups. The most important thing is that the attributes in the re-
lationships form a hierarchy so that traditional outlier detection methods cannot directly be
used in our problem. We apply three algorithms: TDC, BUC and eBUC for cubing. We use
KL-divergence as similarity to identify the outlier type. In business, the most challenging
thing for business analysts is that, they do not know the fraud pattern for the existing rela-
tionships. The outlier type in this thesis means fraud pattern. Once we know how the most
popular fraud pattern looks like, we can come up with policies to prevent such situation.
We use KL-divergence as similarity measure because relationships are complicated and not
independent. Merely compare the values of two relationships is not sufficient to estimate
their similarity. We need to see the distributions of the two relationships. KL-divergence is
tested to be useful to find potential outlier patterns in our data set. The detected outlier
results show that our detection methods are effective. The running time tested on both real
and synthetic data sets shows that our detection methods are efficient.
As future work, there are several interesting directions we can further work on.
• Extending our techniques to arbitrarily multiple parties. In this thesis, we only tackle
a fact table of 2 parties. In the future we can test our detection methods on data sets
46
CHAPTER 6. CONCLUSION 47
with multiple parties. The problem will be much more complicated and challenging.
• Improve the cubing process. In this thesis, we use eBUC algorithm to improve the
BUC cubing. It has been tested that eBUC is more efficient than BUC for our problem.
One interesting question is that, can we do even better? The answer is yes. We can
further improve the cubing process using Dynamic Ordering (DYNO). In [43], DYNO
shows better performance than eBUC.
• Refine the results with data summarization to make our methods more useful in busi-
ness. Our methods can detect a list of outlier relationships, however, to make it useful
in business we need to further refine it. For example, we can apply algorithm in [43]
to get “probable groups”. Probable group is an aggregate cell that satisfies some re-
quirement like “What are the groups of relationships that have 80% or more outlier
descendants?” We can generate the most general probable groups. For example, if
both relationships (female, 20, B.C.) and (∗, 20, B.C.) are probable, we only return
the latter one. In this way, the pattern of fraudulent relationships between service
providers and consumers will be succinct and the detected results can be better un-
derstood by business people.
Bibliography
[1] M. Agyemang, K. Barker, and R. Alhajj. A comprehensive survey of numeric andsymbolic outlier mining techniques. Intell. Data Anal., 10:521–538, 2006.
[2] Z. A. Bakar, R. Mohemad, A. Ahmad, and M. M. Deris. A comparative study foroutlier detection techniques in data mining. In Proc. 2006 IEEE Conf. Cybernetics andIntelligent Systems, pages 1–6, Bangkok, Thailand, 2006.
[3] M. Berry and G. Linoff. Mastering data mining: The art and science of customerrelationship management. John Wiley & Sons, Inc., 1999.
[4] K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cube.ACM SIGMOD Record, 1999.
[5] R. J. Bolton and D. J. Hand. Statistical fraud detection: A review. Statistical Science,17:235–249, 2002.
[6] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Com-puting Surveys (CSUR), 2009.
[7] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Com-puting Surveys, 41:1–58, 2009.
[8] N.V. Chawla, N. Japkowicz, and A. Kotcz. Editorial: special issue on learning fromimbalanced data sets. ACM SIGKDD Explorations Newsletter, 6(1):1–6, 2004.
[9] P.L. Chebyshev. Sur les valeurs limites des integrales. Imprimerie de Gauthier-Villars,1874.
[10] J.B.R. Cheng and AR Hurson. Effective clustering of complex objects in object-orienteddatabases. ACM SIGMOD Record, 1991.
[11] M.C. Cooper, D.M. Lambert, and J.D. Pagh. Supply chain management: more than anew name for logistics. The International Journal of Logistics Management, 1997.
[12] D. Dasgupta and N.S. Majumdar. Anomaly detection in multidimensional data usingnegative selection algorithm. In Proc. 2002 Congress on Evolutionary Computation(CEC’02), pages 1039–1044, Washington DC, 2002.
48
BIBLIOGRAPHY 49
[13] E. Eskin. Anomaly detection over noisy data using learned probability distributions.In Proc. 17th Int. Conf. Machine Learning (ICML’00), 2000.
[14] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework forunsupervised anomaly detection: Detecting intrusions in unlabeled data. In Proc. 2002Int. Conf. of Data Mining for Security Applications, 2002.
[15] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework forunsupervised anomaly detection: Detecting intrusions in unlabeled data. Applicationsof Data Mining in Computer Security, 6:77–102, 2002.
[16] M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discoveringclusters in large spatial databases with noise. In Proceedings of the 2nd InternationalConference on Knowledge Discovery and Data mining, volume 1996, pages 226–231.AAAI Press, 1996.
[17] T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and KnowledgeDiscovery, 1:291–316, 1997.
[18] R. Fujimaki, T. Yairi, and K. Machida. An approach to spacecraft anomaly detectionproblem using kernel feature space. In Proc. 2005 Int. Workshop on Link Discovery(LinkKDD’05), pages 401–410, Chicago, Illinois, 2005.
[19] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: A relational oper-ator generalizing group-by, cross-tab and sub-totals. In Proc. 1996 Int. Conf. DataEngineering (ICDE’96), pages 152–159, New Orleans, Louisiana, Feb. 1996.
[20] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques, 3rd Edition.Morgan Kaufmann, 2011.
[21] D.M. Hawkins. Identification of outliers. Chapman and Hall London, 1980.
[22] Z. He, X. Xu, and S. Deng. Discovering cluster-based local outliers. Pattern RecognitionLetters, 24(9):1641–1650, 2003.
[23] V. Hodge and J. Austin. A survey of outlier detection methodologies. Artificial Intel-ligence Review, 2004.
[24] V. J. Hodge and J. Austin. A survey of outlier detection methodologies. ArtificialIntelligence Review, 22:85–126, 2004.
[25] J. Huang, H. Shimizu, and S. Shioya. Clustering gene expression pattern and extractingrelationship in gene network based on artificial neural networks. Journal of bioscienceand bioengineering, 2003.
[26] W.H. Inmon. Building the data warehouse. Wiley-India, 2005.
BIBLIOGRAPHY 50
[27] B. Jiang, J. Pei, Y. Tao, and X. Lin. Clustering uncertain data based on probabilitydistribution similarity. IEEE Transactions on Knowledge and Data Engineering, 2011.
[28] M.V. Joshi, R.C. Agarwal, and V. Kumar. Mining needle in a haystack: classifyingrare classes via two-phase rule induction. In ACM SIGMOD Record, volume 30, pages91–102. ACM, 2001.
[29] M.V. Joshi, R.C. Agarwal, and V. Kumar. Predicting rare classes: Can boosting makeany weak learner strong? In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining, pages 297–306. ACM, 2002.
[30] E.M. Knorr and R.T. Ng. Algorithms for mining distance-based outliers in largedatasets. In Proceedings of the International Conference on Very Large Data Bases,pages 392–403. Citeseer, 1998.
[31] Y. Kou, C.T. Lu, S. Sirwongwattana, and Y.P. Huang. Survey of fraud detectiontechniques. In IEEE International Conference on Networking, Sensing and Control.IEEE, 2004.
[32] S. Kullback and R.A. Leibler. On information and sufficiency. The Annals of Mathe-matical Statistics, 1951.
[33] C. X. Lin, B. Zhao, Q. Mei, and J. Han. A statistical model for popular event trackingin social communities. In Proc. 2010 ACM SIGKDD Conf. Knowledge Discovery andData Mining (KDD’10), Washington D.C., July 2010.
[34] A. Payne and P. Frow. A strategic framework for customer relationship management.Journal of Marketing, 2005.
[35] C. Phua, D. Alahakoon, and V. Lee. Minority report in fraud detection: classificationof skewed data. ACM SIGKDD Explorations Newsletter, 6(1):50–59, 2004.
[36] K. Ross and D. Srivastava. Fast computation of sparse datacubes. In Proc. 1997 Int.Conf. Very Large Data Bases (VLDB’97), pages 116–125, Athens, Greece, Aug. 1997.
[37] D.W. Scott. Multivariate density estimation. Wiley Online Library, 1992.
[38] B.W. Silverman. Density estimation for statistics and data analysis. Chapman &Hall/CRC, 1986.
[39] A. Strehl and J. Ghosh. Relationship-based clustering and visualization for high-dimensional data mining. INFORMS Journal on Computing, 2003.
[40] R. Vilalta and S. Ma. Predicting rare events in temporal domains. In Data Mining,2002. ICDM 2002. Proceedings. 2002 IEEE International Conference on, pages 474–481. IEEE, 2002.
BIBLIOGRAPHY 51
[41] G.M. Weiss and H. Hirsh. Learning to predict rare events in event sequences. InProceedings of the Fourth International Conference on Knowledge Discovery and DataMining, pages 359–363, 1998.
[42] D. Xin, J. Han, X. Li, and B. W. Wah. Star-cubing: Computing iceberg cubes bytop-down and bottom-up integration. In Proc. 2003 Int. Conf. Very Large Data Bases(VLDB’03), pages 476–487, Berlin, Germany, Sept. 2003.
[43] H. Yu, J. Pei, S. Tang, and D. Yang. Mining most general multidimensional summa-rization of probable groups in data warehouses. In Proceedings of the 17th internationalconference on Scientific and statistical database management. Lawrence Berkeley Lab-oratory, 2005.
[44] K. Zhang, S. Shi, H. Gao, and J. Li. Unsupervised outlier detection in sensor networksusing aggregation tree. Advanced Data Mining and Applications, pages 158–169, 2007.
[45] Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for si-multaneous multidimensional aggregates. In Proc. 1997 ACM-SIGMOD Int. Conf.Management of Data (SIGMOD’97), pages 159–170, Tucson, AZ, May 1997.