Searching in High-Dimensional Spaces

Searching in High-Dimensional SpacesIndex Structures for Improving the Performance of

Multimedia Databases

Christian Böhm, Stefan Berchtold, Daniel A. KeimACM Computing Surveys, 2001

Introduction Multimedia databases have become increasingly im-

portant in many application areas Content-based retrieval of similar objects Similarity search

Feature transformation• Multimedia object → high dimensional points (feature vector)

Search of points in the feature space that are close to a given query point

Traditional Databases

Point, range, partial match query

Multimedia Databases

Similarity search

Similarity Queries Basic idea of feature-based similarity search

ε-Searchor NN-Search

FeatureTransformation

Insert

Complex Data Objects High-Dim. Feature Vectors High-Dim. Index

Range query Nearest-neighbor query

Effects in High-Dimensional Space Curse of dimensionality

Can you imagine 5 or 10-dimension? “Every d-dimensional sphere touching (or intersecting) the

(d-1)-dimensional boundaries of the data space contains c” What happen if d=16?

Effects in High-Dimensional Space Issues

Exponential growth of volume

Space partitioning• The majority of the data pages are located at the surface of the

data space rather than in the interior• Coarse partitioning

0.50.25 917.025.016 0.917

Common Principles Structure & Regions

Hierarchical clustering Spatially adjacent vectors are likely to reside in the same

Basic Algorithms Index construction

Insert, Delete, and Update Query processing

Exact match query Range query Nearest-neighbor query Ranking query (generalized k-nearest-neighbor query) Reverse nearest-neighbor query

Nearest-Neighbor Query No fixed criterion, known a priori, to exclude branches

of the indexing structure The criterion is the nearest-neighbor distance But it is not known until the algorithm has terminated

• Pessimistic estimation• The closest point among all points visited

(closest point candidate)

Nearest-Neighbor Query RKV algorithm

MINDIST : the actual distance between the query point and page region

MINMAXDIST : estimation of the nearest neighbor distance ‘Depth-first’ and ‘Branch and bound’ traversal

MINDISTMINMAXDIST

Nearest-Neighbor Query HS algorithm

Access all pages of the index in the order of increasing dis-tance to the query point

Active page list (APL)

Nearest-Neighbor Query Comparison

RKV• pr1 → pr12 → pr11 →…

HS• pr1 → pr2 → pr21

Index Structures Minimum bounding rectangles

R-tree family X-tree

Bounding spheres SS-tree TV-tree

Combined regions SR-tree

Etc. Space filling curves Pyramid-tree

R, R*, R+-Tree Overlap problem

For an overlap-free split, a dimension is needed in which the projections of the page regions have no overlap at some point• Existence of such a point becomes less likely as the dimension

of the data space increases

R+ tree An overlap-free variant of the R-tree using a forced-split strat-

egy High dimensionality leads to many forced-split operations.

• Storage utilization < 50%

8409.02/1

7071.02/14

effCAa /1A

X-Tree Extension of the R*-tree Designed for the management of high-dimensional ob-

jects Overlap-free split (split history) Supernodes (unbalanced split tree)

kd-Tree Advantage

Guarantee of no overlap Disadvantages

Complete partitioning• Page regions are generally larger than necessary which yields a

higher access probability

Unbalanced

kd-Tree kd-B-tree

Balanced kd-tree Forced split

hB-tree Splitting a node based on

multiple attributes Forced split is avoided

LSDh-tree Coded region description

• Reduce space requirement

SS-Tree Spheres as page regions Split

Split axis is determined as the dimension yielding the highest variance

Not amenable to an easy overlap-free split

Space Filling Curves Range and nearest-neighbor queries based on dis-

tance calculations of page regions

lb : 47 = 101111ub : 60 =111100longest common prefix : p =1s = <p100…000> = 110000 = 48

lb : 48 = 110000ub : 60 =111100longest common prefix : p =11s = <p100…000> = 111000 = 56

Pyramid Tree Divide the data space such that the resulting partitions

are shaped like peels of an onion Pyramid mapping

Optimized for range queries on high-dim. data Not affected by the curse of dimensionality

Summary & Comparison

Conclusions Effects occurring in indexing high-dim. spaces Principal ideas of the index structures that have been

proposed to overcome the problems Research on high-dim. indexing has a major impact on

many practical applications and commercial multime-dia database system

Future Research Issues Real case (not uniform and not independent data) Partitioning strategies that perform well in high-dim. Approximate processing of NN queries

Searching in High-Dimensional Spaces

Documents

Clustering and Indexing in High-dimensional spaces

Convex Optimization in Infinite Dimensional...

Infinite dimensional Riemannian symmetric spaces with ......

Proximity Searching in High Dimensional Spaces with a...

FRAMES IN FINITE-DIMENSIONAL INNER PRODUCT SPACES · PDF...

Finite-Dimensional Vector Spaces. Halmos P.R

Finite Dimensional Vector Spaces Are Complete for Traced...

Finite Dimensional Vector Spaces Halmos P R

On Infinite-Dimensional Linear Spaces (GW Mackey)

Similarity Search in High-dimensional Spaces with ...

Convex Optimization in Infinite Dimensional...

Indexing for Similarity Search - cs.princeton.edu€¦ ·.....

On data depth in inﬁnite dimensional spaces

Optimal Parallel Two Dimensional Text Searching on a...

A MODEL BUILDING BY COSET SPACE DIMENSIONAL REDUCTION ...

Redundant Bit Vectors for Quickly Searching High-Dimensional...