Top Banner
1 17 Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel, University of Munich Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High- Dimensional Data
17

Feature Based Similarity

Feb 25, 2016

Download

Documents

landon narrin

Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel, University of Munich Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data. Feature Based Similarity. Simple Similarity Queries. Specify query object and - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Feature Based Similarity

117

Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel,University of Munich

Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data

Page 2: Feature Based Similarity

217 Feature Based Similarity

Page 3: Feature Based Similarity

317 Simple Similarity Queries

Specify query object and• Find similar objects – range query• Find the k most similar objects – nearest neighbor q.

Page 4: Feature Based Similarity

417 Join Applications: Catalogue Matching

Catalogue matching• E.g. Astronomic catalogues

R

S

Page 5: Feature Based Similarity

517 Join Applications: Clustering

Clustering (e.g. DBSCAN)

Similarity self-join

Page 6: Feature Based Similarity

617 Grid partitioning

General idea: Grid approximation where grid line distance =

Similar idea in the -kdB-tree[Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]

Disadvantage of any grid approach:Number of neighboring grid cells: 3d 1

Page 7: Feature Based Similarity

717 Scalability of the -kdB-tree

Assumption: 2 adjacent -stripes fit in main mem. Unrealistic for large data sets which are ...

• clustered, • skewed and • high-dimensional data

Page 8: Feature Based Similarity

817 Epsilon Grid Order

Page 9: Feature Based Similarity

917 -Grid-Order Is a Total Strict Order

Strict Order:• Irreflexivity• Transitivity• Asymmetry

-grid-order can be used in any sorting algorithm

Page 10: Feature Based Similarity

1017 -Interval

Coarse approximation of join mates:Used for I/O processing

Page 11: Feature Based Similarity

1117 I/O Processing for the Self Join

Decompose the sorted file into I/O units

Page 12: Feature Based Similarity

1217 Epsilon Grid Order

Page 13: Feature Based Similarity

1317 CPU Processing

I/O units are further decomposed before joining Simple divide-and-conquer: No further sorting Decomposition: maximize active dimensions

Page 14: Feature Based Similarity

1417 CPU Processing

Point distance computations: Order of dimensions• Neighboring inactive dimensions• Unspecified dimensions• Active dimension • Aligned inactive dimensions

Page 15: Feature Based Similarity

1517 Experimental Results

8-dimensional uniformly distributed vectors

Page 16: Feature Based Similarity

1617 Experimental Results (2)

16-d feature vectors from CAD application

Page 17: Feature Based Similarity

1717 Conclusions

Summary• High potential for performance gains of the similarity

join by page capacity optimization• Necessary to separately optimize I/O and CPU

Future research potential• Similarity join for metric index structures• Approximate similarity join• Parallel similarity join algorithms