Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Christian BöhmLudwig Maximilians Universität München

The Similarity Join: A Powerful Database Primitive for High Performance Data MiningTutorial, 17th Int. Conf. on Data Engineering, 2001-04-02

Christian BöhmLudwig Maximilians Universität München

The Similarity Join: A Powerful Database Primitive for High Performance Data MiningTutorial, 17th Int. Conf. on Data Engineering, 2001-04-02

Chr

isti

an B

öhm

2

150

11Motivation

Motivation

Chr

isti

an B

öhm

3

150

High Performance Data MiningHigh Performance Data Mining

Fast decisions require knowledge just in time

Marketing Fraud Detection CRM Online Scoring OLAP

Chr

isti

an B

öhm

4

150

Previous Approaches to Fast Data MiningPrevious Approaches to Fast Data Mining

Sampling Approximations (grid) Dimensionality reduct. Parallelism

Loss of quality

Expensive & complex

All approaches combinable with join

KDD appl. get parallelism for free

Chr

isti

an B

öhm

5

150

Feature Based SimilarityFeature Based Similarity

Chr

isti

an B

öhm

6

150

Simple Similarity QueriesSimple Similarity Queries

• Specify query object and- Find similar objects – range query- Find the k most similar objects – nearest neighbor q.

Chr

isti

an B

öhm

7

150

Similarity – Range QueriesSimilarity – Range Queries

• Given: Query point qMaximum distance

• Formal definition:

• Cardinality of the result set is difficult to control: too small no results too large complete DB

Chr

isti

an B

öhm

8

150

Index Based Processing of Range QueriesIndex Based Processing of Range Queries

Chr

isti

an B

öhm

9

150

Similarity – Nearest Neighbor QueriesSimilarity – Nearest Neighbor Queries

• Given: Query point q

• Formal definition:

• Ties must be handled:- Result set enlargement- Non-determinism (don’t care)

Chr

isti

an B

öhm

10150

Index Based Processing of NN QueriesIndex Based Processing of NN Queries

Chr

isti

an B

öhm

11150

k-Nearest Neighbor Search and Rankingk-Nearest Neighbor Search and Ranking

• k-nearest neighbor query:- Do not only search only for one nearest neighbor but k

- Stop distance is the distance of the kth (last) candidate point

-

• Ranking-query:- Incremental version of k-nearest neighbor search- First call of FetchNext() returns first neighbor- Second call of FetchNext() returns second neighbor...- Typically only few results are fetched Don‘t generate all!

Chr

isti

an B

öhm

12150

Advanced Applications: DuplicatesAdvanced Applications: Duplicates

• Duplicate detection- E.g. Astronomic catalogue matching

• Similarity queries for large number of query obj

C1

C2

Chr

isti

an B

öhm

13150

Advanced Applications: Data MiningAdvanced Applications: Data Mining

• Density based clustering (DBSCAN)

Chr

isti

an B

öhm

14150

What is a Similarity Join?What is a Similarity Join?

• Given two sets R, S of points• Find all pairs of points according to similarity

• Various exact definitions for the similarity join

R

S

Chr

isti

an B

öhm

15150

What is a Similarity Join?What is a Similarity Join?

• Similarity join corresponds to set of identical similarity queries, evaluated for a large number of query points

• Sequential evaluation of similarity queries with index is the easiest similarity join algorithm

• Many more sophisticated approaches exist• Powerful database primitive to support modern

applications of data analysis and data mining

Chr

isti

an B

öhm

16150

Curse of DimensionalityCurse of Dimensionality

• Index structures fail (outperformed by the sequential scan) if the data space dimension becomes too high

• Many effects usually called Curse of Dimensionality

Chr

isti

an B

öhm

17150


[Berchtold, Böhm, Keim, Kriegel: A Cost Model for High-Dim. Nearest Neighb. Search, PODS 1997]

With increasing dimension also increases... Typical radius of range queries Distance of a point to its nearest neighbor Edge length of regions of index structures

0.51=0.50.720.5 0.830.

5

Chr

isti

an B

öhm

18150


A cost model for the access probability of index pages using the concept of Minkowski Sum

Chr

isti

an B

öhm

19150


Binomial formula:

Chr

isti

an B

öhm

20150


• Asymptotic behavior of similarity search

• Suppose number points VMink 2d VSphere

• Access probability = O(2d), but limited by 100%• Saturation area with near linear I/O cost O(n)

Chr

isti

an B

öhm

21150


• For high dimension: Each similarity query accesses considerable fraction of all index pages.

• Index does not pay off, anyway sequ. scan• Strategies needed for efficient evaluation• Join: Base applications on powerful database

primitive that exploits high number of queries• Efficient algorithms for Similarity Join

Chr

isti

an B

öhm

22150

Organization of the TutorialOrganization of the Tutorial

1. Motivation

2. Defining the Similarity Join

3. Applications of the Similarity Join

4. Similarity Join Algorithms

5. Conclusion & Future Potential

Chr

isti

an B

öhm

23150

22Defining the Similarity JoinDefining the Similarity Join

Chr

isti

an B

öhm

24150

What Is a Similarity Join?What Is a Similarity Join?

Intuitive notion: 3 properties of the similarity join1. The similarity join is a join in the relational sense

Two sets R and S are combined into one such that the new set contains pairs of points that fulfill a join condition

2. Vector or metric objects rather than ordinary tuples of any type

3. The join condition involves similarity

Chr

isti

an B

öhm

25150

What Is a Similarity Join?What Is a Similarity Join?

Similarity Join

Distance Range Join NN-based Approaches

Closest Pair Query k-NN Join

Chr

isti

an B

öhm

26150

Distance Range Join (-Join)Distance Range Join (-Join)

• Intuitition: Given parameter All pairs of points where distance

• Formal Definition:

• In SQL-like notation:SELECT * FROM R, S WHERE ||R.obj S.obj||

Chr

isti

an B

öhm

27150


• Most widespread and best evaluated join • Often also called the similarity join

Chr

isti

an B

öhm

28150


• The distance range self join

is of particular importance for data mining (clustering) and robust similarity search

• Change definition to exclude trivial results•

Chr

isti

an B

öhm

29150


• Disadvantage for the user:Result cardinality difficult to control: too small no result pairs are produced too large all pairs from R S are produced

• Worst case complexity is at least o(|R||S|)• For reasonable result set size, advanced join

algorithms yield asymptotic behavior which is better than O(|R||S|)

Chr

isti

an B

öhm

30150

k-Closest Pair Queryk-Closest Pair Query

• Intuition: Find those k pairs that yield least distance

• The principle of nearest neighbor search is applied on a basis per pair

• Classical problem of Computational Geometry• In the database context introduced by

[Hjaltason & Samet, Incremental Distance Join Algorithms, SIGMOD Conf. 1998] • There called distance join

Chr

isti

an B

öhm

31150



• Ties solved by result set enlargement

• Other possibility: Non-determinism(don’t care which of the tie tuples are reported)

Chr

isti

an B

öhm

32150


In SQL notation: SELECT * FROM R, SORDER BY ||R.obj S.obj||STOP AFTER k

Chr

isti

an B

öhm

33150


• Self-join:- Exclude |R| trivial pairs (ri,ri) with distance 0

- Result is symmetric

• Applications:- Find all pairs of stock quota in a database that are

most similar to each other- Find music scores which are similar to each other- Noise robust duplicate elimination

Chr

isti

an B

öhm

34150


• Incremental ranking instead of exact specification of k

• No STOP AFTER clause:

SELECT * FROM R, S ORDER BY ||R.obj S.obj||

• Open cursor and fetch results one-by-one• Important: Only few results typically fetched

Don’t determine the complete ranking

Chr

isti

an B

öhm

35150

k-Nearest Neighbor Joink-Nearest Neighbor Join

• Intuition: Combine each point with its k nearest neighbors

• The principle of nearest neighbor search is applied for each point of R

• In the database context introduced by[Hjaltason & Samet, Incremental Distance Join Algorithms, SIGMOD Conf. 1998]

• There called distance semijoin

Chr

isti

an B

öhm

36150



• Ties solved by result set enlargement

• Other possibility: Non-determinism(don’t care which of the tie tuples are reported)

Chr

isti

an B

öhm

37150


In SQL notation:(limited to k = 1)

SELECT * FROM R, SGROUP BY R.objORDER BY ||R.obj S.obj||STOP AFTER K (* k *)

Chr

isti

an B

öhm

38150


• The k-NN-join is inherently asymmetric:

Chr

isti

an B

öhm

39150


• Applications of the k-NN-join:- k-means and k-medoid clustering- Simultaneous nearest neighbor classification:

A large set of new objects without class label are assigned according to the majority of k nearest neighbors of each of the new objects

• Astronomic observation• Online customer scoring

• Ranking on the k-NN-join is difficult to define

Chr

isti

an B

öhm

40150

Further possible definitionsFurther possible definitions

• Inverse nearest neighbor join:Combine each point ri of R with every point of S which considers ri to be its nearest neighbor

• Metric data sets:Instead of vectors use arbitrary objects with a distance metric- E.g. Text sequences with edit distance- Text mining using the similarity join applies A*

Chr

isti

an B

öhm

41150

33ApplicationsApplications

Chr

isti

an B

öhm

42150

Density Based Data MiningDensity Based Data Mining

Chr

isti

an B

öhm

43150

Schema for Data Mining AlgorithmsSchema for Data Mining Algorithms

Algorithmic Schema A1

foreach Point p DPointSet S := SimilarityQuery (p,

);foreach Point q S

DoSomething (p,q) ;

Chr

isti

an B

öhm

44150

Iterative similarity queries and cacheIterative similarity queries and cache

Due to curse of dimensionality:No sufficient inter-query locality of the pages

0,00

0,01

0,02

0,03

0,04

0,05

0,06

0,07

0,08

0 10 20 30 40Dimension (d )

Ave

rag

e ca

che

hit

rat

io

10-nn querysim. range query

Chr

isti

an B

öhm

45150

Iterative similarity queries and cacheIterative similarity queries and cache

Chr

isti

an B

öhm

46150

Idea: Query Order TransformationIdea: Query Order Transformation

[Böhm, Braunmüller, Breunig, Kriegel: High Perf. Clustering based on the Sim. Join, CIKM 2000]

Transform order of similarity queries such that packing of points into pages is considered

If one pair of index pages is in the cache: process all sim. queries regarding this pair

Each pair of pages is considered at most once

Chr

isti

an B

öhm

47150

Idea: Query Order TransformationIdea: Query Order Transformation

Chr

isti

an B

öhm

48150

Transform the Original Schema A1…Transform the Original Schema A1…

Algorithmic Schema A1

foreach Point p DPointSet S := SimilarityQuery (p,

);foreach Point q S

DoSomething (p,q) ;

Chr

isti

an B

öhm

49150

…Into a New Algorithmic Schema A2…Into a New Algorithmic Schema A2

foreach DataPage PLoadAndPinPage (P) ;foreach DataPage Q

if (mindist (P,Q) )CachedAccess (Q) ;foreach Point p P

foreach Point q Qif (distance (p,q) )

DoSomething’ (p,q) ;UnFixPage (P) ;

Chr

isti

an B

öhm

50150

Similarity JoinSimilarity Join

A2 is a Similarity-Join-Algorithm:

foreach PointPair (p,q) DoSomething’ (p,q) ;

Where denotes the Similarity-Join:

SELECT * FROM R r1, R r2

WHERE distance (r1.object, r2.object)

Chr

isti

an B

öhm

51150

Implementation VariantsImplementation Variants

• Change of the order in which points are combined must partially be considered

Implementation

Semantic Materialization

Change algorithm to take unknown order into account

Materialize join result j and answer original queries by j

Chr

isti

an B

öhm

52150

Example Clustering AlgorithmsExample Clustering Algorithms

DBSCAN[Ester, Kriegel, Sander, Xu: A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise´, KDD 1996]

Flat clustering (non hierarchical)

OPTICS[Ankerst, Breunig, Kriegel, Sander: OPTICS: Ordering Points To Identify the Clustering Structure, SIGMOD Conf. 1999]

Hierachicalcluster-structure

1

2

3

Semantic Rewriting Materialization

Chr

isti

an B

öhm

53150

Transformation by Semantic RewritingTransformation by Semantic Rewriting

• Rewrite the algorithm to take the changed order of pairs into account

• Don´t assume any specific order in which pairs are generated Arbitrary similarity join algorithm possible

Chr

isti

an B

öhm

54150

Example: DBSCANExample: DBSCAN

p core object in D wrt. , MinPts: | N (p) | MinPts p directly density-reachable from q in D wrt. , MinPts:

1) p N(q) and 2) q is a core object wrt. , MinPts

density-reachable: transitive closure.

cluster:- maximal wrt. density reachability- any two points are density-reachable from

a third object

Chr

isti

an B

öhm

55150

Implementation of DBSCAN on JoinImplementation of DBSCAN on Join

Core point property:DoSomething() increments a counter attribute

Determination of maximal density-reachable clusters:DoSomething():- Assign ID of known cluster point to unknown cluster points - Unify two known clusters

Chr

isti

an B

öhm

56150


Chr

isti

an B

öhm

57150


Chr

isti

an B

öhm

58150

Implementing OPTICS (Materialization)Implementing OPTICS (Materialization)

• The join result is predetermined before starting the actual OPTICS algorithm

• The result is materialized in some table with GROUP-BY on the first point of the pair

• The OPTICS algorithm runs unchanged• Similarity queries are answered from the join

materialization table (much faster)• Disadvantage: High memory requirements

Chr

isti

an B

öhm

59150

Experimental Results: Page CapacityExperimental Results: Page Capacity

100

1000

10000

100000

1000000

0 2000 4000 6000 8000 10000

page capacity

run

tim

e [

sec]

100

1000

10000

100000

1000000

0 100 200 300page capacity

run

tim

e [

sec]

Q-DBSCAN (Seq. Scan)

Q-DBSCAN (R*-tree)

Q-DBSCAN (X-tree)

J-DBSCAN (R*-tree)

J-DBSCAN (X-tree)

Meteorology data9-dimensional

Color image data64-dimensional

Chr

isti

an B

öhm

60150

Experimental Results: ScalabilityExperimental Results: Scalability

0

30000

60000

90000

120000

150000

0 30000 60000 90000

size of database [points]

run

tim

e [

sec]


Q-DBSCAN (X-tree)

J-DBSCAN (X-tree)

0

30000

60000

90000

120000

150000

50000 150000 250000

size of database [points]

run

tim

e [

sec]

Q-OPTICS (Seq. Scan)

Q-OPTICS (X-tree)

J-OPTICS (X-tree)

Color image data Meteorology data

Chr

isti

an B

öhm

61150

Experimental Results: Query RangeExperimental Results: Query Range

0

10000

20000

30000

40000

50000

60000

70000

0,00 0,05 0,10 0,15 0,20 0,25

epsilon

run

tim

e [

sec]

Querybased (X-tree)

Joinbased (X-tree)

0

20000

40000

60000

80000

100000

120000

140000

0,1 0,15 0,2 0,25 0,3

epsilonru

nti

me

[se

c]

Q-OPTICS (Seq. Scan)Q-OPTICS (X-tree)J-OPTICS (X-tree)

Color image data Color image data

Q-DBSCAN (X-tree)J-DBSCAN (X-tree)

Chr

isti

an B

öhm

62150

Robust Similarity SearchRobust Similarity Search

[Agrawal, Lin, Sawhney, Shim: Fast Similariy Search in the Presence of Noise,...., VLDB 1995]

• Usual similarity search with feature vectors:Not robust with respect to- Noise:

Euclidean distance sensitive to mismatch in single dimension

- Partial similarity: Not complete objects are similar, but parts thereof

• Concept to achieve robustness:Decompose each data object and query object into sub-objects and search for a maximum number of similar subobjects

Chr

isti

an B

öhm

63150


• Prominent concept borrowed from IR research:String decomposition: Search for similar words by indexing of character triplets (n-lets)

• Query transformed to set of similarity queries similarity join between query set and data set

• Robustness achieved in result recombination:- Noise robustness: Ignore missing matches- Partial search: Dont enforce complete recombination

Chr

isti

an B

öhm

64150


Applications:• Robust search for sequences:

[Agrawal, Lin, Sawhney, Shim: Fast Similariy Search in the Presence of Noise,...., VLDB 1995]

• Principle can be generalized for objects like- Raster images- CAD objects- 3D molecules- etc.

Chr

isti

an B

öhm

65150

Astronomic Catalogue MatchingAstronomic Catalogue Matching

• Relative position of catalogues approx. known:- Position and intensity parameters in different bands

C1

C2

• C1 C2

• Determine according to device tolerance

Chr

isti

an B

öhm

66150

Astronomic Catalogue MatchingAstronomic Catalogue Matching

• Relative position unknown:- Match according to triangles and intensity

C1

C2

• Search triangles and store parameters (height,...)• triangles (C1) triangles (C2)

Chr

isti

an B

öhm

67150

k-Nearest Neighbor Classificationk-Nearest Neighbor Classification

• Simultaneous classification of many objects[Braunmüller, Ester, Kriegel, Sander: Efficiently Supporting Multiple Similarity Queries for Mining in Metric Databases, ICDE 2000]

- Astronomy• Some 10,000 new objects collected per night• Classify according to some millions of known objects

- Online customer scoring• Some 1,000 customers online• Rate them according to some millions of known patterns

Chr

isti

an B

öhm

68150

k-Nearest Neighbor Classificationk-Nearest Neighbor Classification

• Example:

Objects with known class

New objects

k = 3

• New objects Known objects

Chr

isti

an B

öhm

69150

k-Means and k-Medoid Clusteringk-Means and k-Medoid Clustering

• k Points initially randomly selected („centers“)• Each database point assigned to nearest center• Centers are re-determined

- k-means: Means of all assigned points (artificial p.)- k-medoid: One central database point of the cluster

• Assignment and center determination are repeated until convergence

Chr

isti

an B

öhm

70150

k-Means and k-Medoid Clusteringk-Means and k-Medoid Clustering

• Example: (k-means with k = 3)

Convergence!

• Each assignment phase: DB-Points Centers

Chr

isti

an B

öhm

71150

44Similarity Join AlgorithmsSimilarity Join Algorithms

Chr

isti

an B

öhm

72150

Algorithms´ OverviewAlgorithms´ Overview

Similarity join

Range dist. join

Closest pair qu.

k-NN join

Index based

Hashing based

Sorting based

on-the-fly index

Optimization

Cost modeling

CPU optimizing

Chr

isti

an B

öhm

73150


Distance range join (-join) Index joins with depth-first and breadth-first search

[Brinkhoff, Kriegel, Seeger: Efficient Proc. of Spatial Joins Using R-trees, SIGMOD Conf. 1993][Brinkhoff, Kriegel, Seeger: Parallel Processing of Spatial Joins Using R-trees, ICDE 1996][Huang, Jing, Rundensteiner: Spatial Joins Usg. R-trees: Breadth-First Traversal..., VLDB 1997]

Index construction on-the-fly[Lo, Ravishankar: Spatial Joins Using Seeded Trees, SIGMOD Conf. 1994][Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997][Shafer, Agrawal: Parallel Algorithms for High-dimensional Similarity Joins, VLDB 1997][van den Bercken, Schneider, Seeger: Plug&Join, EDBT 2000]

Join-algorithms based on hashing[Lo, Ravishankar: Spatial Hash Joins, SIGMOD Conf. 1996][Patel, DeWitt: Partition Based Spatial-Merge Join, SIGMOD Conf. 1997]

Chr

isti

an B

öhm

74150


Join-algorithms based on sorting[Orenstein: An Algorithm for Computing the Overlay of k-Dim. Spaces, SSD 1991][Koudas, Sevcik: High-Dimensional Similarity Joins, ICDE 1997][Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order, SIGMOD Conf. 2001]

Closest pair query and nearest neighbor join[Hjaltason, Samet: Incremental Distance Join Algorithms for Spatial DB, SIGMOD Conf. 1998][Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000][Corral, Manolopoulos, Theodoridis, Vassilakopoulos: Closest Pair Queries in Spatial Databases, SIGMOD Conf. 2000]

Optimization approaches[Böhm, Kriegel: A Cost Model and Index Architecture for the Similarity Join, Wednesday 1630][Böhm, Krebs, Kriegel: Optimal Dimension Sweeping: A Generic Technique, submitted]

Chr

isti

an B

öhm

75150

Nested Loop JoinNested Loop Join

• Simple nested loop join:- Iterate over R-points- Nested iteration over S-points

S is scanned |R| times, high I/O cost

• Nested block loop join:- First iterate over blocks- Nested iterate over tuples

S scanned |R|/|B| times

R S

S-tuples

R-t

uple

s

S-bl

ocks

R-b

lock

s

Chr

isti

an B

öhm

76150

Indexed Nested Loop JoinIndexed Nested Loop Join

• Iterate over every point of R• Determine matches in S by

similarity queries on the index

• Due to the curse of dimensionality: Performance deterioration of the similarity q. Then not competitive with nested loop join(Depends on dimensionality and selectivity determined by )

S

R

Chr

isti

an B

öhm

77150

Spatial Join Similarity Join Spatial Join Similarity Join

• 2D polygon databases• Join-predicate: Overlap• Conserv. approximation:

MBR (ax-par. rectangle)

• High-D point databases• Join-predicate: Distance• Map -join to spatial join

Cube with edge-length

• Some strategies can be borrowed from the spatial join

Chr

isti

an B

öhm

78150

R-tree Spatial Join (RSJ)R-tree Spatial Join (RSJ)

[Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, SIGMOD Conf. 1993]

• Originally: Spatial join for 2D rect. intersection• Depth-first search in R-trees and similar indexes• Assumption: Index preconstructed on R and S• Simple recursion scheme (equal tree height):

procedure r_tree_join (R, S: page) foreach r R.children do foreach s S.children do if intersect (r,s) then r_tree_join (r,s) ;

Chr

isti

an B

öhm

79150


• Adaptation for the similarity join:Distance predicate rather than intersection

• For pair (R,S) of pages: mindist (R,S) Least possible distance of two points in (R,S)

Chr

isti

an B

öhm

80150


procedure r_tree_sim_join (R, S, ) if IsDirpg (R) IsDirpg (S) then foreach r R.children do foreach s S.children do if mindist (r,s) then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,) ; else (* assume R,S both DataPg *) foreach p R.points do foreach q S.points do if |p q| then report (p,q);

R S

Chr

isti

an B

öhm

81150


• Extension to different tree heights straightforw.• Several additional optimizations possible• CPU-bound

- Cost dominated by point-distance calculations

• Disadvantages- No clear strategies for page access priorization- Single page accesses

Can be outperformed by nested block loop join

Chr

isti

an B

öhm

82150

Parallel RSJParallel RSJ

[Brinkhoff, Kriegel, Seeger: Parallel Processing of Spatial Joins Using R-trees, ICDE 1996]

• Again spatial join for 2D rectangle intersection• Three phases of parallel execution:

- Task creation (non-parallel)- Task assignment (non-parallel)- Task execution (completely parallel)

• A task corresponds to a pair of subtrees- At high tree level (e.g. root or second level)

Chr

isti

an B

öhm

83150


• Example for the task definition

Chr

isti

an B

öhm

84150


• Strategy 1: Static Range Assignment

Chr

isti

an B

öhm

85150


• Strategy 2: Static Round-Robin Assignment

Chr

isti

an B

öhm

86150


• Strategy 3: Dynamic task assignment- Processor requests a task when idle- Best load balancing

Chr

isti

an B

öhm

87150

Breadth-First R-tree Join (BFRJ)Breadth-First R-tree Join (BFRJ)

[Huang, Jing, Rundensteiner: Spatial Joins Using R-trees: Breadth-First Traversal..., VLDB 1997]

• Again spatial join for 2D rectangle intersection• Shortcoming of RSJ:

- No strategy in outer loop improving locality in inner - Depth-first traversal not flexible, because a pair of

tree branches must be ended before next pair started

unnecessary page accesses

Chr

isti

an B

öhm

88150


• Solution:- Proceed level by level (breadth-first traversal)- Determine all relevant pairs for the next level

intermediate join index (IJI)- Sort the IJI according to suitable order before

accessing the next level global optimization strategy

Chr

isti

an B

öhm

89150


Chr

isti

an B

öhm

90150


Options for ordering:1. No particular order

2. Consider the lower x-coordinate of R´s nodes

3. Sum of the centers of x-coordinates of R and S

4. x-coordinate of center of common MBR

5. Hilbert-value of center of common MBR

Higher locality (better cache hit rates) for better

ordering strategies.

Chr

isti

an B

öhm

91150


Chr

isti

an B

öhm

92150

Approaches without Preconstructed IndexApproaches without Preconstructed Index

• Indexes can be constructed temporarily for join• R-tree construction by INSERT too expensive

Use cheap bottom-up-construction- Hilbert R-trees: O (n log n)

[Kamel, Faloutsos: Hilbert R-trees: An Improved R-tree using Fractals, VLDB 1994]

Sort points by SFC and pack adjacent points to page- Buffer trees

[van den Bercken, Seeger, Widmayer: A Generic Approach to Bulk Loading.., VLDB 1997]

- Repeated partitioning[Berchtold, Böhm, Kriegel: Improving the Query Performance ..., EDBT 1998]

• Index construction can amortize during join

Chr

isti

an B

öhm

93150

Seeded TreesSeeded Trees

[Lo, Ravishankar: Spatial Joins Using Seeded Trees, SIGMOD Conf. 1994]

• Again spatial join for 2D rectangle intersection• Assumption:

Only one data set (R) is supported by index• Typical application:

Set S is subquery result• Idea:

Use partitioning of R as a template for S

Chr

isti

an B

öhm

94150


• Motivation- Early inserts to R-trees decide initial organization- We know that S will be matched with R- Start with small template tree instead of empty root

seed levels

Chr

isti

an B

öhm

95150


• Tree consist of- Seed levels- Grown levels

• Tree unbalanced• Phases of tree

construction:- Seeding phase- Growing phase- Cleanup phase

Chr

isti

an B

öhm

96150


• Seeding phase:- Copy k levels of the R-tree of set R- Last level: defined MBRs, but empty child pointers

called slot

- Three strategies for (slot and other) MBRs:• Copy complete MBR• Use only center point rather than complete MBR• Center point at slot level, otherwise complete MBR

Chr

isti

an B

öhm

97150


• Growing phase- Insert of points: Choose subtree like in R*-tree- Seed level is not affected during growth phase:

• No insertions to seed level nodes• No split of seed level nodes

- If point is inserted into empty slot (NULL pointer):• A new empty data node is allocated• Further, this node is treated like a root in R-trees:

on overflow, no split is propagated upward (new root)• The R-trees in the slots are called grown subtree.

Chr

isti

an B

öhm

98150


• Growing phase (cont...)- Various strategies for update of the MBRs in the

seed levels during insert operations:• No updates• Enlarge bounding box after insert of a not contained point• Determine minimum bounding rectangle after insert• ...

- In seed levels: In general, the page regions are ...• Not bounding rectangles, i.e. no conservative appx. of set• Not minimal

Chr

isti

an B

öhm

99150


• Cleanup Phase- The MBR property of page regions is needed ...

• ... not for tree construction• ... but for join processing

- Therefore, actual MBRs are determined in cleanup- Empty slots (without grown subtrees) are deleted- No attempt to make the tree balanced

• Join the two indexed sets R and S like in RSJ

Chr

isti

an B

öhm

100150


• Experimental results (spatial data)

Chr

isti

an B

öhm

101150

The -kdB-treeThe -kdB-tree

[Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]

• Algorithm for the range distance self join

• General idea: Grid approximation where grid line distance =

• Not all dimensions used for decomposition:As many dimensions as needed defined node capacity

Chr

isti

an B

öhm

102150


Chr

isti

an B

öhm

103150


• Node fanout: 1/(assuming data space [0..1]d)• Tree structure is specific to given parameter

must be constructed for each join• The -kdB-trees of two adjacent stripes are

assumed to fit into main memory

Chr

isti

an B

öhm

104150


procedure t_match (R, S: node) if is_leaf (R) is_leaf (S) then ... else for i:=1 to 1/1 do t_match(R.child[i], S.child [i]) ; t_match (R.child[i], S.child [i+1]) ; t_match (R.child[i+1], S.child[i]) ; t_match (R.child[1/], S.child[1/]) ;

Chr

isti

an B

öhm

105150


• Limitation:For large values not really scalable

• In high-dimensional cases, =0.3 can be typical 60% of data must be held in main memory

• As long as data fit into main memory:-kdB-tree is one of the best similarity join alg.

• Unfortunately:IBM does not provide any code for comparison

Chr

isti

an B

öhm

106150


Chr

isti

an B

öhm

107150

The Parallel -kdB-treeThe Parallel -kdB-tree

[Shafer, Agrawal: Parallel Algorithms for High-dimensional Similarity Joins, VLDB 1997]

• Parallel construction of the -kdB-tree:- Each processor has random subset of the data (1/N)- Each processor constructs -kdB-tree of its own set- Identical structure is enforced e.g. by split broadcast

CPU1 CPU2

Chr

isti

an B

öhm

108150


• Workload distribution:- Global determination of the cumulated node sizes- A unit workload is a pair (r,s) of leaf nodes- The cost of a workload is

|r||s| for different leaves and |r|(|r|+1)/2 for a single leaf (self join)

- Data is redistributed: Each processor gets 1/N work• join units are clustered to preserve locality• minimize redistribution (communication) and replication

Chr

isti

an B

öhm

109150


• Workload execution:- delete internal structure- cum. node size too large

second growth phase- data redistribution per-

formed asynchronously:Data sent in depth-first order of tree traversal to avoid network flooding

Chr

isti

an B

öhm

110150


Chr

isti

an B

öhm

111150

Plug & JoinPlug & Join

[van den Bercken, Schneider, Seeger: Plug&Join: An Easy-to-Use Generic Algorithm, EDBT 2000]

Generic technique for several kinds of join- Main-memory R-tree constructed from R-sample- Partition R and S acc. to R-tree (buffers at leaves)

1 2 3 4

main memory

R

flush

1 2 3 4

main memory

S

Chr

isti

an B

öhm

112150

Spatial Hash JoinSpatial Hash Join

[Lo, Ravishankar: Spatial Hash Joins, SIGMOD Conf. 1996]

• Method for the spatial join using replication- Set R is partitioned without replication- Set S is partitioned according to R‘s buckets;

replication if intersection with more than 1 R-bucket- Join only corresponding buckets

Chr

isti

an B

öhm

113150


• Partitioning of R:- Using bootstrap-seeding, generates a seeded tree- A suitable number # of slots is determined- The set R is sampled (sample size c #)- Using some clustering method, # cluster centers are

determined in the set- The cluster centers are the slots in the seeded tree- Assign each R-obj. to slot with least enlargement

Chr

isti

an B

öhm

114150


• Partitioning of S and join phase:- Bucket extents of R are copied to S-buckets- For spatial join: Each object s of S is assigned ...

... to all buckets b which are intersected by s- For similarity join:

... to all buckets b with mindist (s,b) - All corresponding bucket pairs (r,s) are joined by

constructing a quadratic split-R-tree on r.- Each obj in s is probed to the R-tree on r.

Chr

isti

an B

öhm

115150


figure 6

Chr

isti

an B

öhm

116150

Partition Based Spatial Merge JoinPartition Based Spatial Merge Join

[Patel, DeWitt: Partition Based Spatial-Merge Join, SIGMOD Conf. 1997]

• Again spatial join method using replication- Both sets R and S are partitioned with replication- Space is regularly decomposed into tiles- Partitions either corre-

spond to tiles or are determined from them using hashing

Chr

isti

an B

öhm

117150


• Duplicate pairs can be generated duplicate elimination by sorting according to (OIDR, OIDS)

• Initial number of partitions determined: (|R| + |S|) size_pt / memsizeThis formula does not take into account:- replication- data skew

Chr

isti

an B

öhm

118150


Chr

isti

an B

öhm

119150

Approaches Using Space Filling CurvesApproaches Using Space Filling Curves

• Space filling curves recur- sively decompose the data space in uniform pieces

• Various different orders:

Chr

isti

an B

öhm

120150


• Efficient filter for the join:Objects in different cells cannot intersect each other Sort-merge-join e.g. on Z-order

• Problem:Object may cross grid lines- either decompose object (redundant)- or assign to containing cell

Chr

isti

an B

öhm

121150


• If all cells have uniform size: Equi-join on grid cell numbers (bit strings)

• If cells have varying size: Bit strings of varying length

• Objects may intersect ...- if bitstr (r) is prefix of bitstr (s)- or bitstr (s) is prefix of bitstr (r)

Chr

isti

an B

öhm

122150

Orenstein‘s Spatial JoinOrenstein‘s Spatial Join

[Orenstein: An Algorithm for Computing the Overlay of k-Dim. Spaces, SSD 1991]

• Allows (limited) redundancy, object decompos.• Algorithm:

- Objects are decomposed- Partial objects are ordered according to the

lexicographical order of the bit strings- Objects are accessed in sort-merge like fashion- Two stacks are maintained to keep track of the

prefix objects of R and S.

Chr

isti

an B

öhm

123150


• Stacks for prefix objects:

Chr

isti

an B

öhm

124150


• Mergesort principle:From the two files, read the next element which is smaller according to the lexicographical order

• The stacks are updated:Discard anything thats not a prefix of new string

• The new object is compared to every object on the other stack

Chr

isti

an B

öhm

125150


• Controlling redundancy:- Allowing no redundancy:

Many objects approximated by empty string- Decomposing every object until basis resolution

No manageable set of objects

• 2 Methods for controlling redundancy:- Size-bound: Given a max. number of partial objects- Error-bound: Given a max. error volume of appx.

Chr

isti

an B

öhm

126150

Multidimensional Spatial JoinMultidimensional Spatial Join

[Koudas, Sevcik: High-Dimensional Similarity Joins, ICDE 1997, Best Paper Award]

• No redundancy allowed at all• Instead of stacks:

Separate level files for different bitstring length• Problems with no redundancy:

- With increasing dimension: increasing - Increasing chance that object intersects one of the

primary decomposition lines approx. by < >

Chr

isti

an B

öhm

127150

Multidimensional Spatial JoinMultidimensional Spatial Join

Chr

isti

an B

öhm

128150

Epsilon Grid OrderEpsilon Grid Order

[Böhm, Braunmüller, Krebs, Kriegel:

Epsilon Grid Order, SIGMOD Conf. 2001]

• Motivation like -kdB-tree:Based on grid with grid line distance

• Possible join mates restricted to 3d cells

• Here no tree structure but sort order of points based on lexicographical order of the grid cells

Chr

isti

an B

öhm

129150


•

Chr

isti

an B

öhm

130150


• A simple exclusion test (used for I/O):A point q with orcannot be join mate of point p or any point beyond p (with respect to epsilon grid order)

• The interval between p[,...,]T and p+[,...,]T is called -interval

Chr

isti

an B

öhm

131150


• Sort file and decompose it into I/O units

Chr

isti

an B

öhm

132150


Chr

isti

an B

öhm

133150


Chr

isti

an B

öhm

134150

Closest Pair QueriesClosest Pair Queries

[Hjaltason, Samet: Incremental Distance Join Algorithms for Spatial DB, SIGMOD Conf. 1998]

• For both point objects and spatial objects• Find k objects with least distance

• Basis algorithm* for nearest neighbor search extended to take point pairs into account

* [Hjaltason, Samet: Ranking in Spatial Databases, SSD 1995]

Chr

isti

an B

öhm

135150

Basis Algorithm for NN SearchBasis Algorithm for NN Search

Active Page List:rootp2 | p1 | p4 | p3p1 | p4 | p24 | p3 | p23 | p21 | p22p14 | p4 | p24 | p3 | p12 | p23 | p13 | p21 | p22

1 2 3 4

11 12 14 2213 21 24 3223 31 33 41 4434 4342

Chr

isti

an B

öhm

136150

Hjaltason/Samet: Closest Pair QueriesHjaltason/Samet: Closest Pair Queries

• Nearest Neighbor Closest Pair Query• k result points k point pairs• active page list list of active page pairs• initialization root pair (rootR, rootS)

• distance point/query distance of point pair• mindist page/query mindist betw. page

pair

Chr

isti

an B

öhm

137150


Active Page List:(root,root)(root,p1)|(root,p2)|(root,p3)|(root,p4)

1 2 3 4

Chr

isti

an B

öhm

138150


• Unidirectional node expansion:Given a pair (ri,sj) only one node is expanded

• Closest pair ranking:Incremental version of k-closest pair queries stopping criterion is validation of next pair

• k-nearest neighbor join:Runs a closest pair ranking and filters out the (k+1)st occurrence (and more) of each point of R

Chr

isti

an B

öhm

139150


• Two strategies for tie breaks (same distance):- Depth-first- Breadth first

• Three policies for tree traversal- Basic (one tree determines priority)- Even (priority to node with shallower depth)- Simultaneous (all possible pairs are candidates for

traversal)

Chr

isti

an B

öhm

140150

Alternative ApproachesAlternative Approaches

[Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000]

• Various improvements and optimizations- Bidirectional node expansion

- Plane sweep technique for bidirectional node exp.- Adaptive multi-stage algorithm

• Aggressive pruning using estimated distances

(root,root) (p1,p3) | (p2, p3) | (p2, p4) | (p1, p2) | (p3, p4) | (p1, p4)

Chr

isti

an B

öhm

141150


[Corral, Manolopoulos, Theodoridis,

Vassilakopoulos: Closest Pair Queries in

Spatial Databases, SIGMOD Conf. 2000]

• 5 different algorithms for closest point queries- Naive: Depth-first traversal of the two R-trees

recursive call for each child pair (ri,sj) of (r,s)

- Exhaustive: like naive but prune page pairs the mindist of which exceeds the current k-CP-dist

- Simple recursive: addit. prune using minmaxdist

maxdistm

inmaxdist

mindist

Chr

isti

an B

öhm

142150


• 5 different algorithms (...)- Sorted distances recursive:

Before descending sort childpairs acc. to their mindist fast get good distance for pruning. Analogous to[Roussopoulos, Kelley, Vincent: Nearest Neighbor Queries. SIGMOD Conf. 1995]

- Heap algorithm:Similar to the algorithm by Hjaltason & Sametwith some minor differences

• New strategies for ties and different tree height

maxdist

minm

axdist

mindist

Chr

isti

an B

öhm

143150

Modeling and OptimizationModeling and Optimization

[Böhm, Kriegel: A Cost Model and Index Architecture for the Similarity Join, Wednesday, 1630]

Mating probability of index pages: Probability that distance between two pages Two-fold application of Minkowski sum

Chr

isti

an B

öhm

144150

Modeling and OptimizationModeling and Optimization

• I/O cost:• High const. cost per page• Large capacity optimum

• CPU cost:• Low const. cost per page• Low capacity optimum

CPU-performance like CPU optimized index

I/O- performance like I/O optimized index

Chr

isti

an B

öhm

145150

Plane Sweep OptimizationPlane Sweep Optimization

[Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, SIGMOD Conf. 1993]

For the directory in the R-tree spatial join (RSJ):- Avoid computation of all C2 box overlaps/distances- Sort boxes according to lower x-coordinates- Plane sweep to

determine the box pairs:- Hold all rectangles inter-

sected by sweep planein the status structure

Sweep plane

Chr

isti

an B

öhm

146150


[Arge, Procopiuc, Ramaswamy, Suel, Vitter: Scalable Sweeping Based Spatial Join, VLDB 1998]

• A plane sweep algorithm for the spatial join- Partition space into k stripes

at most 2N/k objects start/end in each stripe- Rectangle contained in a single strip is called small- Other rectangles decomposed: start, end, centerpiece- Recursive determination of intersections for start-

and endpieces and small rectangles

• Optimum complexity O(n log n + |R S|)

Chr

isti

an B

öhm

147150


[Böhm, Krebs, Kriegel: Optimal Dimension Sweeping: A Generic Technique, submitted for pub.]

• Reduction of the computational cost of point-distances• Most important cost factor for all similairty join algorithms

• Plane-sweep or also sort-merge method:• Sort points on both pages according to a selected dimension• Many point pairs can be excluded beforehand

• Crucial: Dimension• Distance or overlap• Extent of the pages• Probability model

Chr

isti

an B

öhm

148150

55ConclusionsConclusions

Chr

isti

an B

öhm

149150

SummarySummary

• Similarity join is a powerful database primitive• Supports many new applications of

- Data mining- Data analysis

• Considerable performance improvements

Chr

isti

an B

öhm

150150

SummarySummary

• Many different algorithms for the similarity join- Most for the distance range join ( join)- Some approaches for closest pair queries

- Important operation of nearest neighbor join has almost not been considered yet

• All 3 types of join have different applications• Comparison of different join algorithms:

- Mostly a competition for speed

Chr

isti

an B

öhm

151150

SummarySummary

• Only few other advantages/disadvantages:- Scalability:

• MSJ and -kdB-tree have high main memory requirements in high-dimensional spaces

- Existence of an index:• Actually no matter because R-trees can be fast

constructed bottom-up. Construction time often much less than join time

• Even if preconstructed indexes exist:Approaches based on sorting often better

- No good criteria known for algorithm selection

Chr

isti

an B

öhm

152150

Future Research DirectionsFuture Research Directions

• Applications:- Many standard data mining methods accelerable:

• Outlier detection• Various clustering algorithms (e.g. obstacle clustering)• Hough transformation and similar analysis methods• ...

- New data mining methods will become feasable:• Subspace clustering & correlation detection• Methods may become interactive• ...

Chr

isti

an B

öhm

153150


• Algorithms- Sufficient research for join and closest pair query- Almost no convincing approaches for the k-NN-join

Important database primitive for many applications- Parallel Algorithms- Non-vector metric data (e.g. text mining)- Approximative join algorithms

• Similarity search: Approximative search often sufficient• Join performance could be considerably improved

- ...

Chr

isti

an B

öhm

154150


• Optimization of various critical parameters- Dimension- Replication - Index scan strategies- ...

Chr

isti

an B

öhm

155150

??QuestionsQuestions

Chr

isti

an B

öhm

156150

Comparison with Multiple QueriesComparison with Multiple Queries

0

10000

20000

30000

40000

50000

60000

70000

0,00 0,05 0,10 0,15 0,20epsilon

run

tim

e [s

ec] SQ-DBSCAN (X-tree)

MQ-DBSCAN (Scan)

MQ-DBSCAN

J-DBSCAN (X-tree)

Chr

isti

an B

öhm

157150

Experimente: SeitenkapazitätExperimente: Seitenkapazität

100

1000

10000

100000

1000000

0 2000 4000 6000 8000 10000

page capacity

run

tim

e [

sec]

100

1000

10000

100000

1000000

0 100 200 300page capacity

run

tim

e [

sec]


Q-DBSCAN (R*-tree)

Q-DBSCAN (X-tree)

J-DBSCAN (R*-tree)

J-DBSCAN (X-tree)

Meteorology data9-dimensional

Color image data64-dimensional

Chr

isti

an B

öhm

158150

Experimente: AnfrageregionExperimente: Anfrageregion

0

10000

20000

30000

40000

50000

60000

70000

0,00 0,05 0,10 0,15 0,20 0,25

epsilon

run

tim

e [

sec]

Querybased (X-tree)

Joinbased (X-tree)

0

20000

40000

60000

80000

100000

120000

140000

0,1 0,15 0,2 0,25 0,3

epsilonru

nti

me

[se

c]

Q-OPTICS (Seq. Scan)Q-OPTICS (X-tree)J-OPTICS (X-tree)

Color image data Color image data

Q-DBSCAN (X-tree)J-DBSCAN (X-tree)

Chr

isti

an B

öhm

159150

Experimente: Künstliche DatenExperimente: Künstliche Daten

4d-UNIFORM 8d-UNIFORM 8d-UNIFORM

Chr

isti

an B

öhm

160150

Future WorkFuture Work

Weitere KDD-Algorithmen auf Join abstützen Z.B. Outlier Detection Subspace Clustering, Ermittlung von Korrelationen Interaktivität

Neue Algorithmen für den Similarity Join Nutzung des Optimierungspotentials (Dimension,...) Parallelisierung Approximative Join-Bearbeitung „k-nearest-neighbor Joins“ und „k-best-pair Joins“

Chr

isti

an B

öhm

161150

Chr

isti

an B

öhm

162150

Chr

isti

an B

öhm

163150

KDD Algorithms Based on Similarity QueriesKDD Algorithms Based on Similarity Queries

DBSCAN

OPTICS

....

LOF

Dist.Based

Outliers

....

Simultan.Nearest

NeighborClassific.

....

SpatialTrend

Detect.

SpatialAssoc.Rules

Chr

isti

an B

öhm

164150


Cost model opens optimization potential Optimization of the page capacity (# points)

[Böhm, Kriegel: Dynamically Optimizing High-Dimensional Index, EDBT 2000]

Optimized index compression[Berchtold, Böhm, Jagadish, Kriegel, Sander: Independent Quantization: An Index Compression Technique for High-Dimensional Spaces, ICDE 2000]

Optimized dimension assignment[Berchtold, Böhm, Keim, Kriegel, Xu: Optimal Multidimensional Query Processing Using Tree Striping, DaWaK 2000]

Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.

Documents

easiest similarity

similarity range queriesgiven

identical similarity

nearest neighbor searchfirst

data miningdensity

neighbor queriesgiven

data engineering

data space dimension