Top Banner
Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th Int. Conf. on Data Engineering, 2001-04-02
164

Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Dec 31, 2015

Download

Documents

Kerry Wilkerson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Christian BöhmLudwig Maximilians Universität München

The Similarity Join: A Powerful Database Primitive for High Performance Data MiningTutorial, 17th Int. Conf. on Data Engineering, 2001-04-02

Christian BöhmLudwig Maximilians Universität München

The Similarity Join: A Powerful Database Primitive for High Performance Data MiningTutorial, 17th Int. Conf. on Data Engineering, 2001-04-02

Page 2: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

2

150

11Motivation

Motivation

Page 3: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

3

150

High Performance Data MiningHigh Performance Data Mining

 

Fast decisions require knowledge just in time

Marketing Fraud Detection CRM Online Scoring OLAP

Page 4: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

4

150

Previous Approaches to Fast Data MiningPrevious Approaches to Fast Data Mining

Sampling Approximations (grid) Dimensionality reduct. Parallelism

Loss of quality

Expensive & complex

All approaches combinable with join

KDD appl. get parallelism for free

Page 5: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

5

150

Feature Based SimilarityFeature Based Similarity

Page 6: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

6

150

Simple Similarity QueriesSimple Similarity Queries

• Specify query object and- Find similar objects – range query- Find the k most similar objects – nearest neighbor q.

Page 7: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

7

150

Similarity – Range QueriesSimilarity – Range Queries

• Given: Query point qMaximum distance

• Formal definition:

• Cardinality of the result set is difficult to control: too small no results too large complete DB

Page 8: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

8

150

Index Based Processing of Range QueriesIndex Based Processing of Range Queries

Page 9: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

9

150

Similarity – Nearest Neighbor QueriesSimilarity – Nearest Neighbor Queries

• Given: Query point q

• Formal definition:

• Ties must be handled:- Result set enlargement- Non-determinism (don’t care)

Page 10: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

10150

Index Based Processing of NN QueriesIndex Based Processing of NN Queries

Page 11: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

11150

k-Nearest Neighbor Search and Rankingk-Nearest Neighbor Search and Ranking

• k-nearest neighbor query:- Do not only search only for one nearest neighbor but k

- Stop distance is the distance of the kth (last) candidate point

-

• Ranking-query:- Incremental version of k-nearest neighbor search- First call of FetchNext() returns first neighbor- Second call of FetchNext() returns second neighbor...- Typically only few results are fetched Don‘t generate all!

Page 12: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

12150

Advanced Applications: DuplicatesAdvanced Applications: Duplicates

• Duplicate detection- E.g. Astronomic catalogue matching

• Similarity queries for large number of query obj

C1

C2

Page 13: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

13150

Advanced Applications: Data MiningAdvanced Applications: Data Mining

• Density based clustering (DBSCAN)

Page 14: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

14150

What is a Similarity Join?What is a Similarity Join?

• Given two sets R, S of points• Find all pairs of points according to similarity

• Various exact definitions for the similarity join

R

S

Page 15: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

15150

What is a Similarity Join?What is a Similarity Join?

• Similarity join corresponds to set of identical similarity queries, evaluated for a large number of query points

• Sequential evaluation of similarity queries with index is the easiest similarity join algorithm

• Many more sophisticated approaches exist• Powerful database primitive to support modern

applications of data analysis and data mining

Page 16: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

16150

Curse of DimensionalityCurse of Dimensionality

• Index structures fail (outperformed by the sequential scan) if the data space dimension becomes too high

• Many effects usually called Curse of Dimensionality

Page 17: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

17150

Curse of DimensionalityCurse of Dimensionality

[Berchtold, Böhm, Keim, Kriegel: A Cost Model for High-Dim. Nearest Neighb. Search, PODS 1997]

With increasing dimension also increases... Typical radius of range queries Distance of a point to its nearest neighbor Edge length of regions of index structures

0.51=0.50.720.5 0.830.

5

Page 18: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

18150

Curse of DimensionalityCurse of Dimensionality

A cost model for the access probability of index pages using the concept of Minkowski Sum

Page 19: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

19150

Curse of DimensionalityCurse of Dimensionality

Binomial formula:

Page 20: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

20150

Curse of DimensionalityCurse of Dimensionality

• Asymptotic behavior of similarity search

• Suppose number points VMink 2d VSphere

• Access probability = O(2d), but limited by 100%• Saturation area with near linear I/O cost O(n)

Page 21: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

21150

Curse of DimensionalityCurse of Dimensionality

• For high dimension: Each similarity query accesses considerable fraction of all index pages.

• Index does not pay off, anyway sequ. scan• Strategies needed for efficient evaluation• Join: Base applications on powerful database

primitive that exploits high number of queries• Efficient algorithms for Similarity Join

Page 22: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

22150

Organization of the TutorialOrganization of the Tutorial

1. Motivation

2. Defining the Similarity Join

3. Applications of the Similarity Join

4. Similarity Join Algorithms

5. Conclusion & Future Potential

Page 23: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

23150

22Defining the Similarity JoinDefining the Similarity Join

Page 24: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

24150

What Is a Similarity Join?What Is a Similarity Join?

Intuitive notion: 3 properties of the similarity join1. The similarity join is a join in the relational sense

Two sets R and S are combined into one such that the new set contains pairs of points that fulfill a join condition

2. Vector or metric objects rather than ordinary tuples of any type

3. The join condition involves similarity

Page 25: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

25150

What Is a Similarity Join?What Is a Similarity Join?

Similarity Join

Distance Range Join NN-based Approaches

Closest Pair Query k-NN Join

Page 26: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

26150

Distance Range Join (-Join)Distance Range Join (-Join)

• Intuitition: Given parameter All pairs of points where distance

• Formal Definition:

• In SQL-like notation:SELECT * FROM R, S WHERE ||R.obj S.obj||

Page 27: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

27150

Distance Range Join (-Join)Distance Range Join (-Join)

• Most widespread and best evaluated join • Often also called the similarity join

Page 28: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

28150

Distance Range Join (-Join)Distance Range Join (-Join)

• The distance range self join

is of particular importance for data mining (clustering) and robust similarity search

• Change definition to exclude trivial results•

Page 29: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

29150

Distance Range Join (-Join)Distance Range Join (-Join)

• Disadvantage for the user:Result cardinality difficult to control: too small no result pairs are produced too large all pairs from R S are produced

• Worst case complexity is at least o(|R||S|)• For reasonable result set size, advanced join

algorithms yield asymptotic behavior which is better than O(|R||S|)

Page 30: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

30150

k-Closest Pair Queryk-Closest Pair Query

• Intuition: Find those k pairs that yield least distance

• The principle of nearest neighbor search is applied on a basis per pair

• Classical problem of Computational Geometry• In the database context introduced by

[Hjaltason & Samet, Incremental Distance Join Algorithms, SIGMOD Conf. 1998] • There called distance join

Page 31: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

31150

k-Closest Pair Queryk-Closest Pair Query

• Formal Definition:

• Ties solved by result set enlargement

• Other possibility: Non-determinism(don’t care which of the tie tuples are reported)

Page 32: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

32150

k-Closest Pair Queryk-Closest Pair Query

In SQL notation: SELECT * FROM R, SORDER BY ||R.obj S.obj||STOP AFTER k

Page 33: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

33150

k-Closest Pair Queryk-Closest Pair Query

• Self-join:- Exclude |R| trivial pairs (ri,ri) with distance 0

- Result is symmetric

• Applications:- Find all pairs of stock quota in a database that are

most similar to each other- Find music scores which are similar to each other- Noise robust duplicate elimination

Page 34: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

34150

k-Closest Pair Queryk-Closest Pair Query

• Incremental ranking instead of exact specification of k

• No STOP AFTER clause:

SELECT * FROM R, S ORDER BY ||R.obj S.obj||

• Open cursor and fetch results one-by-one• Important: Only few results typically fetched

Don’t determine the complete ranking

Page 35: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

35150

k-Nearest Neighbor Joink-Nearest Neighbor Join

• Intuition: Combine each point with its k nearest neighbors

• The principle of nearest neighbor search is applied for each point of R

• In the database context introduced by[Hjaltason & Samet, Incremental Distance Join Algorithms, SIGMOD Conf. 1998]

• There called distance semijoin

Page 36: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

36150

k-Nearest Neighbor Joink-Nearest Neighbor Join

• Formal Definition:

• Ties solved by result set enlargement

• Other possibility: Non-determinism(don’t care which of the tie tuples are reported)

Page 37: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

37150

k-Nearest Neighbor Joink-Nearest Neighbor Join

In SQL notation:(limited to k = 1)

SELECT * FROM R, SGROUP BY R.objORDER BY ||R.obj S.obj||STOP AFTER K (* k *)

Page 38: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

38150

k-Nearest Neighbor Joink-Nearest Neighbor Join

• The k-NN-join is inherently asymmetric:

Page 39: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

39150

k-Nearest Neighbor Joink-Nearest Neighbor Join

• Applications of the k-NN-join:- k-means and k-medoid clustering- Simultaneous nearest neighbor classification:

A large set of new objects without class label are assigned according to the majority of k nearest neighbors of each of the new objects

• Astronomic observation• Online customer scoring

• Ranking on the k-NN-join is difficult to define

Page 40: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

40150

Further possible definitionsFurther possible definitions

• Inverse nearest neighbor join:Combine each point ri of R with every point of S which considers ri to be its nearest neighbor

• Metric data sets:Instead of vectors use arbitrary objects with a distance metric- E.g. Text sequences with edit distance- Text mining using the similarity join applies A*

Page 41: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

41150

33ApplicationsApplications

Page 42: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

42150

Density Based Data MiningDensity Based Data Mining

Page 43: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

43150

Schema for Data Mining AlgorithmsSchema for Data Mining Algorithms

Algorithmic Schema A1

foreach Point p DPointSet S := SimilarityQuery (p,

);foreach Point q S

DoSomething (p,q) ;

Page 44: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

44150

Iterative similarity queries and cacheIterative similarity queries and cache

Due to curse of dimensionality:No sufficient inter-query locality of the pages

0,00

0,01

0,02

0,03

0,04

0,05

0,06

0,07

0,08

0 10 20 30 40Dimension (d )

Ave

rag

e ca

che

hit

rat

io

10-nn querysim. range query

Page 45: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

45150

Iterative similarity queries and cacheIterative similarity queries and cache

Page 46: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

46150

Idea: Query Order TransformationIdea: Query Order Transformation

[Böhm, Braunmüller, Breunig, Kriegel: High Perf. Clustering based on the Sim. Join, CIKM 2000]

Transform order of similarity queries such that packing of points into pages is considered

If one pair of index pages is in the cache: process all sim. queries regarding this pair

Each pair of pages is considered at most once

Page 47: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

47150

Idea: Query Order TransformationIdea: Query Order Transformation

Page 48: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

48150

Transform the Original Schema A1…Transform the Original Schema A1…

Algorithmic Schema A1

foreach Point p DPointSet S := SimilarityQuery (p,

);foreach Point q S

DoSomething (p,q) ;

Page 49: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

49150

…Into a New Algorithmic Schema A2…Into a New Algorithmic Schema A2

foreach DataPage PLoadAndPinPage (P) ;foreach DataPage Q

if (mindist (P,Q) )CachedAccess (Q) ;foreach Point p P

foreach Point q Qif (distance (p,q) )

DoSomething’ (p,q) ;UnFixPage (P) ;

Page 50: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

50150

Similarity JoinSimilarity Join

A2 is a Similarity-Join-Algorithm:

foreach PointPair (p,q) DoSomething’ (p,q) ;

Where denotes the Similarity-Join:

SELECT * FROM R r1, R r2

WHERE distance (r1.object, r2.object)

Page 51: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

51150

Implementation VariantsImplementation Variants

• Change of the order in which points are combined must partially be considered

Implementation

Semantic Materialization

Change algorithm to take unknown order into account

Materialize join result j and answer original queries by j

Page 52: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

52150

Example Clustering AlgorithmsExample Clustering Algorithms

DBSCAN[Ester, Kriegel, Sander, Xu: A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise´, KDD 1996]

Flat clustering (non hierarchical)

OPTICS[Ankerst, Breunig, Kriegel, Sander: OPTICS: Ordering Points To Identify the Clustering Structure, SIGMOD Conf. 1999]

Hierachicalcluster-structure

1

2

3

Semantic Rewriting Materialization

Page 53: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

53150

Transformation by Semantic RewritingTransformation by Semantic Rewriting

• Rewrite the algorithm to take the changed order of pairs into account

• Don´t assume any specific order in which pairs are generated Arbitrary similarity join algorithm possible

Page 54: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

54150

Example: DBSCANExample: DBSCAN

p core object in D wrt. , MinPts: | N (p) | MinPts p directly density-reachable from q in D wrt. , MinPts:

1) p N(q) and 2) q is a core object wrt. , MinPts

density-reachable: transitive closure.

cluster:- maximal wrt. density reachability- any two points are density-reachable from

a third object

Page 55: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

55150

Implementation of DBSCAN on JoinImplementation of DBSCAN on Join

Core point property:DoSomething() increments a counter attribute

Determination of maximal density-reachable clusters:DoSomething():- Assign ID of known cluster point to unknown cluster points - Unify two known clusters

Page 56: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

56150

Implementation of DBSCAN on JoinImplementation of DBSCAN on Join

Page 57: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

57150

Implementation of DBSCAN on JoinImplementation of DBSCAN on Join

Page 58: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

58150

Implementing OPTICS (Materialization)Implementing OPTICS (Materialization)

• The join result is predetermined before starting the actual OPTICS algorithm

• The result is materialized in some table with GROUP-BY on the first point of the pair

• The OPTICS algorithm runs unchanged• Similarity queries are answered from the join

materialization table (much faster)• Disadvantage: High memory requirements

Page 59: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

59150

Experimental Results: Page CapacityExperimental Results: Page Capacity

100

1000

10000

100000

1000000

0 2000 4000 6000 8000 10000

page capacity

run

tim

e [

sec]

100

1000

10000

100000

1000000

0 100 200 300page capacity

run

tim

e [

sec]

Q-DBSCAN (Seq. Scan)

Q-DBSCAN (R*-tree)

Q-DBSCAN (X-tree)

J-DBSCAN (R*-tree)

J-DBSCAN (X-tree)

Meteorology data9-dimensional

Color image data64-dimensional

Page 60: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

60150

Experimental Results: ScalabilityExperimental Results: Scalability

0

30000

60000

90000

120000

150000

0 30000 60000 90000

size of database [points]

run

tim

e [

sec]

Q-DBSCAN (Seq. Scan)

Q-DBSCAN (X-tree)

J-DBSCAN (X-tree)

0

30000

60000

90000

120000

150000

50000 150000 250000

size of database [points]

run

tim

e [

sec]

Q-OPTICS (Seq. Scan)

Q-OPTICS (X-tree)

J-OPTICS (X-tree)

Color image data Meteorology data

Page 61: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

61150

Experimental Results: Query RangeExperimental Results: Query Range

0

10000

20000

30000

40000

50000

60000

70000

0,00 0,05 0,10 0,15 0,20 0,25

epsilon

run

tim

e [

sec]

Querybased (X-tree)

Joinbased (X-tree)

0

20000

40000

60000

80000

100000

120000

140000

0,1 0,15 0,2 0,25 0,3

epsilonru

nti

me

[se

c]

Q-OPTICS (Seq. Scan)Q-OPTICS (X-tree)J-OPTICS (X-tree)

Color image data Color image data

Q-DBSCAN (X-tree)J-DBSCAN (X-tree)

Page 62: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

62150

Robust Similarity SearchRobust Similarity Search

[Agrawal, Lin, Sawhney, Shim: Fast Similariy Search in the Presence of Noise,...., VLDB 1995]

• Usual similarity search with feature vectors:Not robust with respect to- Noise:

Euclidean distance sensitive to mismatch in single dimension

- Partial similarity: Not complete objects are similar, but parts thereof

• Concept to achieve robustness:Decompose each data object and query object into sub-objects and search for a maximum number of similar subobjects

Page 63: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

63150

Robust Similarity SearchRobust Similarity Search

• Prominent concept borrowed from IR research:String decomposition: Search for similar words by indexing of character triplets (n-lets)

• Query transformed to set of similarity queries similarity join between query set and data set

• Robustness achieved in result recombination:- Noise robustness: Ignore missing matches- Partial search: Dont enforce complete recombination

Page 64: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

64150

Robust Similarity SearchRobust Similarity Search

Applications:• Robust search for sequences:

[Agrawal, Lin, Sawhney, Shim: Fast Similariy Search in the Presence of Noise,...., VLDB 1995]

• Principle can be generalized for objects like- Raster images- CAD objects- 3D molecules- etc.

Page 65: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

65150

Astronomic Catalogue MatchingAstronomic Catalogue Matching

• Relative position of catalogues approx. known:- Position and intensity parameters in different bands

C1

C2

• C1 C2

• Determine according to device tolerance

Page 66: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

66150

Astronomic Catalogue MatchingAstronomic Catalogue Matching

• Relative position unknown:- Match according to triangles and intensity

C1

C2

• Search triangles and store parameters (height,...)• triangles (C1) triangles (C2)

Page 67: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

67150

k-Nearest Neighbor Classificationk-Nearest Neighbor Classification

• Simultaneous classification of many objects[Braunmüller, Ester, Kriegel, Sander: Efficiently Supporting Multiple Similarity Queries for Mining in Metric Databases, ICDE 2000]

- Astronomy• Some 10,000 new objects collected per night• Classify according to some millions of known objects

- Online customer scoring• Some 1,000 customers online• Rate them according to some millions of known patterns

Page 68: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

68150

k-Nearest Neighbor Classificationk-Nearest Neighbor Classification

• Example:

Objects with known class

New objects

k = 3

• New objects Known objects

Page 69: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

69150

k-Means and k-Medoid Clusteringk-Means and k-Medoid Clustering

• k Points initially randomly selected („centers“)• Each database point assigned to nearest center• Centers are re-determined

- k-means: Means of all assigned points (artificial p.)- k-medoid: One central database point of the cluster

• Assignment and center determination are repeated until convergence

Page 70: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

70150

k-Means and k-Medoid Clusteringk-Means and k-Medoid Clustering

• Example: (k-means with k = 3)

Convergence!

• Each assignment phase: DB-Points Centers

Page 71: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

71150

44Similarity Join AlgorithmsSimilarity Join Algorithms

Page 72: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

72150

Algorithms´ OverviewAlgorithms´ Overview

Similarity join

Range dist. join

Closest pair qu.

k-NN join

Index based

Hashing based

Sorting based

on-the-fly index

Optimization

Cost modeling

CPU optimizing

Page 73: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

73150

Algorithms´ OverviewAlgorithms´ Overview

Distance range join (-join) Index joins with depth-first and breadth-first search

[Brinkhoff, Kriegel, Seeger: Efficient Proc. of Spatial Joins Using R-trees, SIGMOD Conf. 1993][Brinkhoff, Kriegel, Seeger: Parallel Processing of Spatial Joins Using R-trees, ICDE 1996][Huang, Jing, Rundensteiner: Spatial Joins Usg. R-trees: Breadth-First Traversal..., VLDB 1997]

Index construction on-the-fly[Lo, Ravishankar: Spatial Joins Using Seeded Trees, SIGMOD Conf. 1994][Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997][Shafer, Agrawal: Parallel Algorithms for High-dimensional Similarity Joins, VLDB 1997][van den Bercken, Schneider, Seeger: Plug&Join, EDBT 2000]

Join-algorithms based on hashing[Lo, Ravishankar: Spatial Hash Joins, SIGMOD Conf. 1996][Patel, DeWitt: Partition Based Spatial-Merge Join, SIGMOD Conf. 1997]

Page 74: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

74150

Algorithms´ OverviewAlgorithms´ Overview

Join-algorithms based on sorting[Orenstein: An Algorithm for Computing the Overlay of k-Dim. Spaces, SSD 1991][Koudas, Sevcik: High-Dimensional Similarity Joins, ICDE 1997][Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order, SIGMOD Conf. 2001]

Closest pair query and nearest neighbor join[Hjaltason, Samet: Incremental Distance Join Algorithms for Spatial DB, SIGMOD Conf. 1998][Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000][Corral, Manolopoulos, Theodoridis, Vassilakopoulos: Closest Pair Queries in Spatial Databases, SIGMOD Conf. 2000]

Optimization approaches[Böhm, Kriegel: A Cost Model and Index Architecture for the Similarity Join, Wednesday 1630][Böhm, Krebs, Kriegel: Optimal Dimension Sweeping: A Generic Technique, submitted]

Page 75: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

75150

Nested Loop JoinNested Loop Join

• Simple nested loop join:- Iterate over R-points- Nested iteration over S-points

S is scanned |R| times, high I/O cost

• Nested block loop join:- First iterate over blocks- Nested iterate over tuples

S scanned |R|/|B| times

R S

S-tuples

R-t

uple

s

S-bl

ocks

R-b

lock

s

Page 76: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

76150

Indexed Nested Loop JoinIndexed Nested Loop Join

• Iterate over every point of R• Determine matches in S by

similarity queries on the index

• Due to the curse of dimensionality: Performance deterioration of the similarity q. Then not competitive with nested loop join(Depends on dimensionality and selectivity determined by )

S

R

Page 77: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

77150

Spatial Join Similarity Join Spatial Join Similarity Join

• 2D polygon databases• Join-predicate: Overlap• Conserv. approximation:

MBR (ax-par. rectangle)

• High-D point databases• Join-predicate: Distance• Map -join to spatial join

Cube with edge-length

• Some strategies can be borrowed from the spatial join

Page 78: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

78150

R-tree Spatial Join (RSJ)R-tree Spatial Join (RSJ)

[Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, SIGMOD Conf. 1993]

• Originally: Spatial join for 2D rect. intersection• Depth-first search in R-trees and similar indexes• Assumption: Index preconstructed on R and S• Simple recursion scheme (equal tree height):

procedure r_tree_join (R, S: page) foreach r R.children do foreach s S.children do if intersect (r,s) then r_tree_join (r,s) ;

Page 79: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

79150

R-tree Spatial Join (RSJ)R-tree Spatial Join (RSJ)

• Adaptation for the similarity join:Distance predicate rather than intersection

• For pair (R,S) of pages: mindist (R,S) Least possible distance of two points in (R,S)

Page 80: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

80150

R-tree Spatial Join (RSJ)R-tree Spatial Join (RSJ)

procedure r_tree_sim_join (R, S, ) if IsDirpg (R) IsDirpg (S) then foreach r R.children do foreach s S.children do if mindist (r,s) then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,) ; else (* assume R,S both DataPg *) foreach p R.points do foreach q S.points do if |p q| then report (p,q);

R S

Page 81: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

81150

R-tree Spatial Join (RSJ)R-tree Spatial Join (RSJ)

• Extension to different tree heights straightforw.• Several additional optimizations possible• CPU-bound

- Cost dominated by point-distance calculations

• Disadvantages- No clear strategies for page access priorization- Single page accesses

Can be outperformed by nested block loop join

Page 82: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

82150

Parallel RSJParallel RSJ

[Brinkhoff, Kriegel, Seeger: Parallel Processing of Spatial Joins Using R-trees, ICDE 1996]

• Again spatial join for 2D rectangle intersection• Three phases of parallel execution:

- Task creation (non-parallel)- Task assignment (non-parallel)- Task execution (completely parallel)

• A task corresponds to a pair of subtrees- At high tree level (e.g. root or second level)

Page 83: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

83150

Parallel RSJParallel RSJ

• Example for the task definition

Page 84: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

84150

Parallel RSJParallel RSJ

• Strategy 1: Static Range Assignment

Page 85: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

85150

Parallel RSJParallel RSJ

• Strategy 2: Static Round-Robin Assignment

Page 86: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

86150

Parallel RSJParallel RSJ

• Strategy 3: Dynamic task assignment- Processor requests a task when idle- Best load balancing

Page 87: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

87150

Breadth-First R-tree Join (BFRJ)Breadth-First R-tree Join (BFRJ)

[Huang, Jing, Rundensteiner: Spatial Joins Using R-trees: Breadth-First Traversal..., VLDB 1997]

• Again spatial join for 2D rectangle intersection• Shortcoming of RSJ:

- No strategy in outer loop improving locality in inner - Depth-first traversal not flexible, because a pair of

tree branches must be ended before next pair started

unnecessary page accesses

Page 88: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

88150

Breadth-First R-tree Join (BFRJ)Breadth-First R-tree Join (BFRJ)

• Solution:- Proceed level by level (breadth-first traversal)- Determine all relevant pairs for the next level

intermediate join index (IJI)- Sort the IJI according to suitable order before

accessing the next level global optimization strategy

Page 89: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

89150

Breadth-First R-tree Join (BFRJ)Breadth-First R-tree Join (BFRJ)

Page 90: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

90150

Breadth-First R-tree Join (BFRJ)Breadth-First R-tree Join (BFRJ)

Options for ordering:1. No particular order

2. Consider the lower x-coordinate of R´s nodes

3. Sum of the centers of x-coordinates of R and S

4. x-coordinate of center of common MBR

5. Hilbert-value of center of common MBR

Higher locality (better cache hit rates) for better

ordering strategies.

Page 91: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

91150

Breadth-First R-tree Join (BFRJ)Breadth-First R-tree Join (BFRJ)

Page 92: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

92150

Approaches without Preconstructed IndexApproaches without Preconstructed Index

• Indexes can be constructed temporarily for join• R-tree construction by INSERT too expensive

Use cheap bottom-up-construction- Hilbert R-trees: O (n log n)

[Kamel, Faloutsos: Hilbert R-trees: An Improved R-tree using Fractals, VLDB 1994]

Sort points by SFC and pack adjacent points to page- Buffer trees

[van den Bercken, Seeger, Widmayer: A Generic Approach to Bulk Loading.., VLDB 1997]

- Repeated partitioning[Berchtold, Böhm, Kriegel: Improving the Query Performance ..., EDBT 1998]

• Index construction can amortize during join

Page 93: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

93150

Seeded TreesSeeded Trees

[Lo, Ravishankar: Spatial Joins Using Seeded Trees, SIGMOD Conf. 1994]

• Again spatial join for 2D rectangle intersection• Assumption:

Only one data set (R) is supported by index• Typical application:

Set S is subquery result• Idea:

Use partitioning of R as a template for S

Page 94: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

94150

Seeded TreesSeeded Trees

• Motivation- Early inserts to R-trees decide initial organization- We know that S will be matched with R- Start with small template tree instead of empty root

seed levels

Page 95: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

95150

Seeded TreesSeeded Trees

• Tree consist of- Seed levels- Grown levels

• Tree unbalanced• Phases of tree

construction:- Seeding phase- Growing phase- Cleanup phase

Page 96: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

96150

Seeded TreesSeeded Trees

• Seeding phase:- Copy k levels of the R-tree of set R- Last level: defined MBRs, but empty child pointers

called slot

- Three strategies for (slot and other) MBRs:• Copy complete MBR• Use only center point rather than complete MBR• Center point at slot level, otherwise complete MBR

Page 97: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

97150

Seeded TreesSeeded Trees

• Growing phase- Insert of points: Choose subtree like in R*-tree- Seed level is not affected during growth phase:

• No insertions to seed level nodes• No split of seed level nodes

- If point is inserted into empty slot (NULL pointer):• A new empty data node is allocated• Further, this node is treated like a root in R-trees:

on overflow, no split is propagated upward (new root)• The R-trees in the slots are called grown subtree.

Page 98: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

98150

Seeded TreesSeeded Trees

• Growing phase (cont...)- Various strategies for update of the MBRs in the

seed levels during insert operations:• No updates• Enlarge bounding box after insert of a not contained point• Determine minimum bounding rectangle after insert• ...

- In seed levels: In general, the page regions are ...• Not bounding rectangles, i.e. no conservative appx. of set• Not minimal

Page 99: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

99150

Seeded TreesSeeded Trees

• Cleanup Phase- The MBR property of page regions is needed ...

• ... not for tree construction• ... but for join processing

- Therefore, actual MBRs are determined in cleanup- Empty slots (without grown subtrees) are deleted- No attempt to make the tree balanced

• Join the two indexed sets R and S like in RSJ

Page 100: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

100150

Seeded TreesSeeded Trees

• Experimental results (spatial data)

Page 101: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

101150

The -kdB-treeThe -kdB-tree

[Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]

• Algorithm for the range distance self join

• General idea: Grid approximation where grid line distance =

• Not all dimensions used for decomposition:As many dimensions as needed defined node capacity

Page 102: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

102150

The -kdB-treeThe -kdB-tree

Page 103: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

103150

The -kdB-treeThe -kdB-tree

• Node fanout: 1/(assuming data space [0..1]d)• Tree structure is specific to given parameter

must be constructed for each join• The -kdB-trees of two adjacent stripes are

assumed to fit into main memory

Page 104: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

104150

The -kdB-treeThe -kdB-tree

procedure t_match (R, S: node) if is_leaf (R) is_leaf (S) then ... else for i:=1 to 1/1 do t_match(R.child[i], S.child [i]) ; t_match (R.child[i], S.child [i+1]) ; t_match (R.child[i+1], S.child[i]) ; t_match (R.child[1/], S.child[1/]) ;

Page 105: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

105150

The -kdB-treeThe -kdB-tree

• Limitation:For large values not really scalable

• In high-dimensional cases, =0.3 can be typical 60% of data must be held in main memory

• As long as data fit into main memory:-kdB-tree is one of the best similarity join alg.

• Unfortunately:IBM does not provide any code for comparison

Page 106: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

106150

The -kdB-treeThe -kdB-tree

Page 107: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

107150

The Parallel -kdB-treeThe Parallel -kdB-tree

[Shafer, Agrawal: Parallel Algorithms for High-dimensional Similarity Joins, VLDB 1997]

• Parallel construction of the -kdB-tree:- Each processor has random subset of the data (1/N)- Each processor constructs -kdB-tree of its own set- Identical structure is enforced e.g. by split broadcast

CPU1 CPU2

Page 108: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

108150

The Parallel -kdB-treeThe Parallel -kdB-tree

• Workload distribution:- Global determination of the cumulated node sizes- A unit workload is a pair (r,s) of leaf nodes- The cost of a workload is

|r||s| for different leaves and |r|(|r|+1)/2 for a single leaf (self join)

- Data is redistributed: Each processor gets 1/N work• join units are clustered to preserve locality• minimize redistribution (communication) and replication

Page 109: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

109150

The Parallel -kdB-treeThe Parallel -kdB-tree

• Workload execution:- delete internal structure- cum. node size too large

second growth phase- data redistribution per-

formed asynchronously:Data sent in depth-first order of tree traversal to avoid network flooding

Page 110: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

110150

The Parallel -kdB-treeThe Parallel -kdB-tree

Page 111: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

111150

Plug & JoinPlug & Join

[van den Bercken, Schneider, Seeger: Plug&Join: An Easy-to-Use Generic Algorithm, EDBT 2000]

Generic technique for several kinds of join- Main-memory R-tree constructed from R-sample- Partition R and S acc. to R-tree (buffers at leaves)

1 2 3 4

main memory

R

flush

1 2 3 4

main memory

S

Page 112: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

112150

Spatial Hash JoinSpatial Hash Join

[Lo, Ravishankar: Spatial Hash Joins, SIGMOD Conf. 1996]

• Method for the spatial join using replication- Set R is partitioned without replication- Set S is partitioned according to R‘s buckets;

replication if intersection with more than 1 R-bucket- Join only corresponding buckets

Page 113: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

113150

Spatial Hash JoinSpatial Hash Join

• Partitioning of R:- Using bootstrap-seeding, generates a seeded tree- A suitable number # of slots is determined- The set R is sampled (sample size c #)- Using some clustering method, # cluster centers are

determined in the set- The cluster centers are the slots in the seeded tree- Assign each R-obj. to slot with least enlargement

Page 114: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

114150

Spatial Hash JoinSpatial Hash Join

• Partitioning of S and join phase:- Bucket extents of R are copied to S-buckets- For spatial join: Each object s of S is assigned ...

... to all buckets b which are intersected by s- For similarity join:

... to all buckets b with mindist (s,b) - All corresponding bucket pairs (r,s) are joined by

constructing a quadratic split-R-tree on r.- Each obj in s is probed to the R-tree on r.

Page 115: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

115150

Spatial Hash JoinSpatial Hash Join

figure 6

Page 116: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

116150

Partition Based Spatial Merge JoinPartition Based Spatial Merge Join

[Patel, DeWitt: Partition Based Spatial-Merge Join, SIGMOD Conf. 1997]

• Again spatial join method using replication- Both sets R and S are partitioned with replication- Space is regularly decomposed into tiles- Partitions either corre-

spond to tiles or are determined from them using hashing

Page 117: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

117150

Partition Based Spatial Merge JoinPartition Based Spatial Merge Join

• Duplicate pairs can be generated duplicate elimination by sorting according to (OIDR, OIDS)

• Initial number of partitions determined: (|R| + |S|) size_pt / memsizeThis formula does not take into account:- replication- data skew

Page 118: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

118150

Partition Based Spatial Merge JoinPartition Based Spatial Merge Join

Page 119: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

119150

Approaches Using Space Filling CurvesApproaches Using Space Filling Curves

• Space filling curves recur- sively decompose the data space in uniform pieces

• Various different orders:

Page 120: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

120150

Approaches Using Space Filling CurvesApproaches Using Space Filling Curves

• Efficient filter for the join:Objects in different cells cannot intersect each other Sort-merge-join e.g. on Z-order

• Problem:Object may cross grid lines- either decompose object (redundant)- or assign to containing cell

Page 121: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

121150

Approaches Using Space Filling CurvesApproaches Using Space Filling Curves

• If all cells have uniform size: Equi-join on grid cell numbers (bit strings)

• If cells have varying size: Bit strings of varying length

• Objects may intersect ...- if bitstr (r) is prefix of bitstr (s)- or bitstr (s) is prefix of bitstr (r)

Page 122: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

122150

Orenstein‘s Spatial JoinOrenstein‘s Spatial Join

[Orenstein: An Algorithm for Computing the Overlay of k-Dim. Spaces, SSD 1991]

• Allows (limited) redundancy, object decompos.• Algorithm:

- Objects are decomposed- Partial objects are ordered according to the

lexicographical order of the bit strings- Objects are accessed in sort-merge like fashion- Two stacks are maintained to keep track of the

prefix objects of R and S.

Page 123: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

123150

Orenstein‘s Spatial JoinOrenstein‘s Spatial Join

• Stacks for prefix objects:

Page 124: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

124150

Orenstein‘s Spatial JoinOrenstein‘s Spatial Join

• Mergesort principle:From the two files, read the next element which is smaller according to the lexicographical order

• The stacks are updated:Discard anything thats not a prefix of new string

• The new object is compared to every object on the other stack

Page 125: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

125150

Orenstein‘s Spatial JoinOrenstein‘s Spatial Join

• Controlling redundancy:- Allowing no redundancy:

Many objects approximated by empty string- Decomposing every object until basis resolution

No manageable set of objects

• 2 Methods for controlling redundancy:- Size-bound: Given a max. number of partial objects- Error-bound: Given a max. error volume of appx.

Page 126: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

126150

Multidimensional Spatial JoinMultidimensional Spatial Join

[Koudas, Sevcik: High-Dimensional Similarity Joins, ICDE 1997, Best Paper Award]

• No redundancy allowed at all• Instead of stacks:

Separate level files for different bitstring length• Problems with no redundancy:

- With increasing dimension: increasing - Increasing chance that object intersects one of the

primary decomposition lines approx. by < >

Page 127: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

127150

Multidimensional Spatial JoinMultidimensional Spatial Join

Page 128: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

128150

Epsilon Grid OrderEpsilon Grid Order

[Böhm, Braunmüller, Krebs, Kriegel:

Epsilon Grid Order, SIGMOD Conf. 2001]

• Motivation like -kdB-tree:Based on grid with grid line distance

• Possible join mates restricted to 3d cells

• Here no tree structure but sort order of points based on lexicographical order of the grid cells

Page 129: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

129150

Epsilon Grid OrderEpsilon Grid Order

Page 130: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

130150

Epsilon Grid OrderEpsilon Grid Order

• A simple exclusion test (used for I/O):A point q with orcannot be join mate of point p or any point beyond p (with respect to epsilon grid order)

• The interval between p[,...,]T and p+[,...,]T is called -interval

Page 131: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

131150

Epsilon Grid OrderEpsilon Grid Order

• Sort file and decompose it into I/O units

Page 132: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

132150

Epsilon Grid OrderEpsilon Grid Order

Page 133: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

133150

Epsilon Grid OrderEpsilon Grid Order

Page 134: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

134150

Closest Pair QueriesClosest Pair Queries

[Hjaltason, Samet: Incremental Distance Join Algorithms for Spatial DB, SIGMOD Conf. 1998]

• For both point objects and spatial objects• Find k objects with least distance

• Basis algorithm* for nearest neighbor search extended to take point pairs into account

* [Hjaltason, Samet: Ranking in Spatial Databases, SSD 1995]

Page 135: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

135150

Basis Algorithm for NN SearchBasis Algorithm for NN Search

Active Page List:rootp2 | p1 | p4 | p3p1 | p4 | p24 | p3 | p23 | p21 | p22p14 | p4 | p24 | p3 | p12 | p23 | p13 | p21 | p22

1 2 3 4

11 12 14 2213 21 24 3223 31 33 41 4434 4342

Page 136: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

136150

Hjaltason/Samet: Closest Pair QueriesHjaltason/Samet: Closest Pair Queries

• Nearest Neighbor Closest Pair Query• k result points k point pairs• active page list list of active page pairs• initialization root pair (rootR, rootS)

• distance point/query distance of point pair• mindist page/query mindist betw. page

pair

Page 137: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

137150

Hjaltason/Samet: Closest Pair QueriesHjaltason/Samet: Closest Pair Queries

Active Page List:(root,root)(root,p1)|(root,p2)|(root,p3)|(root,p4)

1 2 3 4

Page 138: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

138150

Hjaltason/Samet: Closest Pair QueriesHjaltason/Samet: Closest Pair Queries

• Unidirectional node expansion:Given a pair (ri,sj) only one node is expanded

• Closest pair ranking:Incremental version of k-closest pair queries stopping criterion is validation of next pair

• k-nearest neighbor join:Runs a closest pair ranking and filters out the (k+1)st occurrence (and more) of each point of R

Page 139: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

139150

Hjaltason/Samet: Closest Pair QueriesHjaltason/Samet: Closest Pair Queries

• Two strategies for tie breaks (same distance):- Depth-first- Breadth first

• Three policies for tree traversal- Basic (one tree determines priority)- Even (priority to node with shallower depth)- Simultaneous (all possible pairs are candidates for

traversal)

Page 140: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

140150

Alternative ApproachesAlternative Approaches

[Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000]

• Various improvements and optimizations- Bidirectional node expansion

- Plane sweep technique for bidirectional node exp.- Adaptive multi-stage algorithm

• Aggressive pruning using estimated distances

(root,root) (p1,p3) | (p2, p3) | (p2, p4) | (p1, p2) | (p3, p4) | (p1, p4)

Page 141: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

141150

Alternative ApproachesAlternative Approaches

[Corral, Manolopoulos, Theodoridis,

Vassilakopoulos: Closest Pair Queries in

Spatial Databases, SIGMOD Conf. 2000]

• 5 different algorithms for closest point queries- Naive: Depth-first traversal of the two R-trees

recursive call for each child pair (ri,sj) of (r,s)

- Exhaustive: like naive but prune page pairs the mindist of which exceeds the current k-CP-dist

- Simple recursive: addit. prune using minmaxdist

maxdistm

inmaxdist

mindist

Page 142: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

142150

Alternative ApproachesAlternative Approaches

• 5 different algorithms (...)- Sorted distances recursive:

Before descending sort childpairs acc. to their mindist fast get good distance for pruning. Analogous to[Roussopoulos, Kelley, Vincent: Nearest Neighbor Queries. SIGMOD Conf. 1995]

- Heap algorithm:Similar to the algorithm by Hjaltason & Sametwith some minor differences

• New strategies for ties and different tree height

maxdist

minm

axdist

mindist

Page 143: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

143150

Modeling and OptimizationModeling and Optimization

[Böhm, Kriegel: A Cost Model and Index Architecture for the Similarity Join, Wednesday, 1630]

Mating probability of index pages: Probability that distance between two pages Two-fold application of Minkowski sum

Page 144: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

144150

Modeling and OptimizationModeling and Optimization

• I/O cost:• High const. cost per page• Large capacity optimum

• CPU cost:• Low const. cost per page• Low capacity optimum

CPU-performance like CPU optimized index

I/O- performance like I/O optimized index

Page 145: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

145150

Plane Sweep OptimizationPlane Sweep Optimization

[Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, SIGMOD Conf. 1993]

For the directory in the R-tree spatial join (RSJ):- Avoid computation of all C2 box overlaps/distances- Sort boxes according to lower x-coordinates- Plane sweep to

determine the box pairs:- Hold all rectangles inter-

sected by sweep planein the status structure

Sweep plane

Page 146: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

146150

Plane Sweep OptimizationPlane Sweep Optimization

[Arge, Procopiuc, Ramaswamy, Suel, Vitter: Scalable Sweeping Based Spatial Join, VLDB 1998]

• A plane sweep algorithm for the spatial join- Partition space into k stripes

at most 2N/k objects start/end in each stripe- Rectangle contained in a single strip is called small- Other rectangles decomposed: start, end, centerpiece- Recursive determination of intersections for start-

and endpieces and small rectangles

• Optimum complexity O(n log n + |R S|)

Page 147: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

147150

Plane Sweep OptimizationPlane Sweep Optimization

[Böhm, Krebs, Kriegel: Optimal Dimension Sweeping: A Generic Technique, submitted for pub.]

• Reduction of the computational cost of point-distances• Most important cost factor for all similairty join algorithms

• Plane-sweep or also sort-merge method:• Sort points on both pages according to a selected dimension• Many point pairs can be excluded beforehand

• Crucial: Dimension• Distance or overlap• Extent of the pages• Probability model

Page 148: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

148150

55ConclusionsConclusions

Page 149: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

149150

SummarySummary

• Similarity join is a powerful database primitive• Supports many new applications of

- Data mining- Data analysis

• Considerable performance improvements

Page 150: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

150150

SummarySummary

• Many different algorithms for the similarity join- Most for the distance range join ( join)- Some approaches for closest pair queries

- Important operation of nearest neighbor join has almost not been considered yet

• All 3 types of join have different applications• Comparison of different join algorithms:

- Mostly a competition for speed

Page 151: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

151150

SummarySummary

• Only few other advantages/disadvantages:- Scalability:

• MSJ and -kdB-tree have high main memory requirements in high-dimensional spaces

- Existence of an index:• Actually no matter because R-trees can be fast

constructed bottom-up. Construction time often much less than join time

• Even if preconstructed indexes exist:Approaches based on sorting often better

- No good criteria known for algorithm selection

Page 152: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

152150

Future Research DirectionsFuture Research Directions

• Applications:- Many standard data mining methods accelerable:

• Outlier detection• Various clustering algorithms (e.g. obstacle clustering)• Hough transformation and similar analysis methods• ...

- New data mining methods will become feasable:• Subspace clustering & correlation detection• Methods may become interactive• ...

Page 153: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

153150

Future Research DirectionsFuture Research Directions

• Algorithms- Sufficient research for join and closest pair query- Almost no convincing approaches for the k-NN-join

Important database primitive for many applications- Parallel Algorithms- Non-vector metric data (e.g. text mining)- Approximative join algorithms

• Similarity search: Approximative search often sufficient• Join performance could be considerably improved

- ...

Page 154: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

154150

Future Research DirectionsFuture Research Directions

• Optimization of various critical parameters- Dimension- Replication - Index scan strategies- ...

Page 155: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

155150

??QuestionsQuestions

Page 156: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

156150

Comparison with Multiple QueriesComparison with Multiple Queries

0

10000

20000

30000

40000

50000

60000

70000

0,00 0,05 0,10 0,15 0,20epsilon

run

tim

e [s

ec] SQ-DBSCAN (X-tree)

MQ-DBSCAN (Scan)

MQ-DBSCAN

J-DBSCAN (X-tree)

Page 157: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

157150

Experimente: SeitenkapazitätExperimente: Seitenkapazität

100

1000

10000

100000

1000000

0 2000 4000 6000 8000 10000

page capacity

run

tim

e [

sec]

100

1000

10000

100000

1000000

0 100 200 300page capacity

run

tim

e [

sec]

Q-DBSCAN (Seq. Scan)

Q-DBSCAN (R*-tree)

Q-DBSCAN (X-tree)

J-DBSCAN (R*-tree)

J-DBSCAN (X-tree)

Meteorology data9-dimensional

Color image data64-dimensional

Page 158: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

158150

Experimente: AnfrageregionExperimente: Anfrageregion

0

10000

20000

30000

40000

50000

60000

70000

0,00 0,05 0,10 0,15 0,20 0,25

epsilon

run

tim

e [

sec]

Querybased (X-tree)

Joinbased (X-tree)

0

20000

40000

60000

80000

100000

120000

140000

0,1 0,15 0,2 0,25 0,3

epsilonru

nti

me

[se

c]

Q-OPTICS (Seq. Scan)Q-OPTICS (X-tree)J-OPTICS (X-tree)

Color image data Color image data

Q-DBSCAN (X-tree)J-DBSCAN (X-tree)

Page 159: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

159150

Experimente: Künstliche DatenExperimente: Künstliche Daten

4d-UNIFORM 8d-UNIFORM 8d-UNIFORM

Page 160: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

160150

Future WorkFuture Work

Weitere KDD-Algorithmen auf Join abstützen Z.B. Outlier Detection Subspace Clustering, Ermittlung von Korrelationen Interaktivität

Neue Algorithmen für den Similarity Join Nutzung des Optimierungspotentials (Dimension,...) Parallelisierung Approximative Join-Bearbeitung „k-nearest-neighbor Joins“ und „k-best-pair Joins“

Page 161: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

161150

Page 162: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

162150

Page 163: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

163150

KDD Algorithms Based on Similarity QueriesKDD Algorithms Based on Similarity Queries

DBSCAN

OPTICS

....

LOF

Dist.Based

Outliers

....

Simultan.Nearest

NeighborClassific.

....

SpatialTrend

Detect.

SpatialAssoc.Rules

Page 164: Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Chr

isti

an B

öhm

164150

Curse of DimensionalityCurse of Dimensionality

Cost model opens optimization potential Optimization of the page capacity (# points)

[Böhm, Kriegel: Dynamically Optimizing High-Dimensional Index, EDBT 2000]

Optimized index compression[Berchtold, Böhm, Jagadish, Kriegel, Sander: Independent Quantization: An Index Compression Technique for High-Dimensional Spaces, ICDE 2000]

Optimized dimension assignment[Berchtold, Böhm, Keim, Kriegel, Xu: Optimal Multidimensional Query Processing Using Tree Striping, DaWaK 2000]