Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško , Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics and Physics Charles University in Prague 1 SISAP 2011, Lipari
Mar 31, 2015
SISAP 2011, Lipari 1
Clustered Pivot Tables forI/O-optimized Similarity SearchJuraj Moško, Jakub Lokoč, Tomáš Skopal
Department of Software Engineering
Faculty of Mathematics and Physics
Charles University in Prague
SISAP 2011, Lipari 2
Presentation outlineSimilarity search in metric spaces
Pivot tables
Clustered pivot tables◦Static variant◦Dynamic variant
Experiments
SISAP 2011, Lipari 3
Similarity searchSuitable for unstructured data, query often not
in DB
Similarity is often modeled by a metric distance
Expensive distance functions - EMD, SQFD, DTW, …
Metric indexing◦ Based on lower-bounding◦ If abs(d(p, q) – d(p, o)) > r
filter out object o
SISAP 2011, Lipari 4
Pivot tables Simple yet efficient main memory metric index Having k static pivots Pi and database S of n objects
Oj, pivot table stores all the distances d(Pi, Oj) in the matrix of size k x n
Pivot tables = two structures - distance matrix + data file
Cheap filtering of non-relevant objects (lower-bounding)
Non-filtered objects are refined by the original expensive distance function
SISAP 2011, Lipari 5
Clustered pivot tablesWhat if the pivot table does not fit into
main memory?
Solution 1 – just slice datafile◦+ simple to construct◦ - sequential scan => high I/O cost
Solution 2 – reorganize and slice datafile◦+ similar objects in one page (page = cluster)
=> higher probability that all objects are filtered=> lower I/O cost
◦ - metric clustering is expensive
SISAP 2011, Lipari 6
Metric clustering? M-tree!Dynamic, persistent, balanced
structureLeaf node represents cluster of similar
objectsMany construction strategies
considering quality of M-tree hierarchy with complexity < O(n2)◦Single/Multi/Hybrid-way leaf
selection◦Slim-down algorithm◦Reinsertions
SISAP 2011, Lipari 7
Static CPTData file = objects serialized from M-
tree leaves◦Classic pivot table reorganizing input
Fixed page size in a paged data file
Preserve M-tree?◦ Future re-indexing◦ Query processing
SISAP 2011, Lipari 8
Dynamic CPTData file = set of M-tree leaves
◦ Distance matrix connected to the M-tree leaves
Internal fragmentation◦ M-tree leaves contain different number of
data objects, utilization is not 100%Dynamic operations do not
degenerate created clusters
SISAP 2011, Lipari 9
CPT - QueryingFiltering based on lower-bounding
If all data objects from one page are filtered out, page from data file is not loaded into memory => I/O optimization
SISAP 2011, Lipari 10
CPT - Querying problemsProblem 1 – LAESA kNN algorithm
sorts DB objects according to their lower bound to the query object – not optimal for I/O cost◦ Solution - CPT does not sort objects =>
objects are processed sequentially
SISAP 2011, Lipari 11
CPT – Querying problemsProblem 2 – in CPT the dynamic radius
decreases slower during the kNN processing◦ Solution - First bunch of objects is not
clustered
SISAP 2011, Lipari 12
CPT – Querying problemsProblem 2 – in CPT the dynamic radius
decreases slower during the kNN processing◦ Solution - First bunch of objects is not
clustered
Qx
Qx
SISAP 2011, Lipari 13
Experiments (1)2 real datasets
◦subset of CoPhIR, subset of Corel2 synthetic datasets
◦Cloud, PolygonSetWe considered more M-tree variants
◦Single/Multi way leaf selection◦Reinsertions
Measured I/O costCPT vs. PT vs. M-tree
SISAP 2011, Lipari 14
Experiments (2)
SISAP 2011, Lipari 15
Experiments (3)
SISAP 2011, Lipari 16
ConclusionWe have designed I/O-optimized
method for persistent pivot tables
Future work◦Thorough experiments on SSD disks◦Use other metric clustering
techniques
SISAP 2011, Lipari 17
Thank you