Top Banner
Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško , Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics and Physics Charles University in Prague 1 SISAP 2011, Lipari
17

Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

Mar 31, 2015

Download

Documents

Sonny Pedley
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

SISAP 2011, Lipari 1

Clustered Pivot Tables forI/O-optimized Similarity SearchJuraj Moško, Jakub Lokoč, Tomáš Skopal

Department of Software Engineering

Faculty of Mathematics and Physics

Charles University in Prague

Page 2: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

SISAP 2011, Lipari 2

Presentation outlineSimilarity search in metric spaces

Pivot tables

Clustered pivot tables◦Static variant◦Dynamic variant

Experiments

Page 3: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

SISAP 2011, Lipari 3

Similarity searchSuitable for unstructured data, query often not

in DB

Similarity is often modeled by a metric distance

Expensive distance functions - EMD, SQFD, DTW, …

Metric indexing◦ Based on lower-bounding◦ If abs(d(p, q) – d(p, o)) > r

filter out object o

Page 4: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

SISAP 2011, Lipari 4

Pivot tables Simple yet efficient main memory metric index Having k static pivots Pi and database S of n objects

Oj, pivot table stores all the distances d(Pi, Oj) in the matrix of size k x n

Pivot tables = two structures - distance matrix + data file

Cheap filtering of non-relevant objects (lower-bounding)

Non-filtered objects are refined by the original expensive distance function

Page 5: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

SISAP 2011, Lipari 5

Clustered pivot tablesWhat if the pivot table does not fit into

main memory?

Solution 1 – just slice datafile◦+ simple to construct◦ - sequential scan => high I/O cost

Solution 2 – reorganize and slice datafile◦+ similar objects in one page (page = cluster)

=> higher probability that all objects are filtered=> lower I/O cost

◦ - metric clustering is expensive

Page 6: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

SISAP 2011, Lipari 6

Metric clustering? M-tree!Dynamic, persistent, balanced

structureLeaf node represents cluster of similar

objectsMany construction strategies

considering quality of M-tree hierarchy with complexity < O(n2)◦Single/Multi/Hybrid-way leaf

selection◦Slim-down algorithm◦Reinsertions

Page 7: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

SISAP 2011, Lipari 7

Static CPTData file = objects serialized from M-

tree leaves◦Classic pivot table reorganizing input

Fixed page size in a paged data file

Preserve M-tree?◦ Future re-indexing◦ Query processing

Page 8: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

SISAP 2011, Lipari 8

Dynamic CPTData file = set of M-tree leaves

◦ Distance matrix connected to the M-tree leaves

Internal fragmentation◦ M-tree leaves contain different number of

data objects, utilization is not 100%Dynamic operations do not

degenerate created clusters

Page 9: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

SISAP 2011, Lipari 9

CPT - QueryingFiltering based on lower-bounding

If all data objects from one page are filtered out, page from data file is not loaded into memory => I/O optimization

Page 10: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

SISAP 2011, Lipari 10

CPT - Querying problemsProblem 1 – LAESA kNN algorithm

sorts DB objects according to their lower bound to the query object – not optimal for I/O cost◦ Solution - CPT does not sort objects =>

objects are processed sequentially

Page 11: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

SISAP 2011, Lipari 11

CPT – Querying problemsProblem 2 – in CPT the dynamic radius

decreases slower during the kNN processing◦ Solution - First bunch of objects is not

clustered

Page 12: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

SISAP 2011, Lipari 12

CPT – Querying problemsProblem 2 – in CPT the dynamic radius

decreases slower during the kNN processing◦ Solution - First bunch of objects is not

clustered

Qx

Qx

Page 13: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

SISAP 2011, Lipari 13

Experiments (1)2 real datasets

◦subset of CoPhIR, subset of Corel2 synthetic datasets

◦Cloud, PolygonSetWe considered more M-tree variants

◦Single/Multi way leaf selection◦Reinsertions

Measured I/O costCPT vs. PT vs. M-tree

Page 14: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

SISAP 2011, Lipari 14

Experiments (2)

Page 15: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

SISAP 2011, Lipari 15

Experiments (3)

Page 16: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

SISAP 2011, Lipari 16

ConclusionWe have designed I/O-optimized

method for persistent pivot tables

Future work◦Thorough experiments on SSD disks◦Use other metric clustering

techniques

Page 17: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

SISAP 2011, Lipari 17

Thank you