Searching in High-Dimensional Spaces

Searching in High-Dimensional SpacesIndex Structures for Improving the Performance of

Multimedia Databases

Christian Böhm, Stefan Berchtold, Daniel A. KeimACM Computing Surveys, 2001

Introduction Multimedia databases have become increasingly im-

portant in many application areas Content-based retrieval of similar objects Similarity search

Feature transformation• Multimedia object → high dimensional points (feature vector)

Search of points in the feature space that are close to a given query point

2

Traditional Databases

Point, range, partial match query

Multimedia Databases

Similarity search

Similarity Queries Basic idea of feature-based similarity search

3

ε-Searchor NN-Search

FeatureTransformation

Insert

Complex Data Objects High-Dim. Feature Vectors High-Dim. Index

NN

Range query Nearest-neighbor query

Effects in High-Dimensional Space Curse of dimensionality

Can you imagine 5 or 10-dimension? “Every d-dimensional sphere touching (or intersecting) the

(d-1)-dimensional boundaries of the data space contains c” What happen if d=16?

4

Effects in High-Dimensional Space Issues

Exponential growth of volume

Space partitioning• The majority of the data pages are located at the surface of the

data space rather than in the interior• Coarse partitioning

5

0.5

0.50.25 917.025.016 0.917

0.917

Common Principles Structure & Regions

Hierarchical clustering Spatially adjacent vectors are likely to reside in the same

node

6

Basic Algorithms Index construction

Insert, Delete, and Update Query processing

Exact match query Range query Nearest-neighbor query Ranking query (generalized k-nearest-neighbor query) Reverse nearest-neighbor query

7

Nearest-Neighbor Query No fixed criterion, known a priori, to exclude branches

of the indexing structure The criterion is the nearest-neighbor distance But it is not known until the algorithm has terminated

• Pessimistic estimation• The closest point among all points visited

(closest point candidate)

8

Nearest-Neighbor Query RKV algorithm

MINDIST : the actual distance between the query point and page region

MINMAXDIST : estimation of the nearest neighbor distance ‘Depth-first’ and ‘Branch and bound’ traversal

9

MINDISTMINMAXDIST

Nearest-Neighbor Query HS algorithm

Access all pages of the index in the order of increasing dis-tance to the query point

Active page list (APL)

10

p3

p1

p2

p31

p1

p33

p2

p32

p1

p311

p312

p33

p2

p32

p11

p311

p312

p33

p2

p32

p13

p12

p311

p312

p33

p111

p2

p112

p32

p13

p12

Nearest-Neighbor Query Comparison

RKV• pr1 → pr12 → pr11 →…

HS• pr1 → pr2 → pr21

11

Index Structures Minimum bounding rectangles

R-tree family X-tree

Bounding spheres SS-tree TV-tree

Combined regions SR-tree

Etc. Space filling curves Pyramid-tree

12

R, R*, R+-Tree Overlap problem

For an overlap-free split, a dimension is needed in which the projections of the page regions have no overlap at some point• Existence of such a point becomes less likely as the dimension

of the data space increases

R+ tree An overlap-free variant of the R-tree using a forced-split strat-

egy High dimensionality leads to many forced-split operations.

• Storage utilization < 50%

13

8409.02/1

7071.02/14

2

d

effCAa /1A

a

X-Tree Extension of the R*-tree Designed for the management of high-dimensional ob-

jects Overlap-free split (split history) Supernodes (unbalanced split tree)

14

kd-Tree Advantage

Guarantee of no overlap Disadvantages

Complete partitioning• Page regions are generally larger than necessary which yields a

higher access probability

Unbalanced

15

kd-Tree kd-B-tree

Balanced kd-tree Forced split

hB-tree Splitting a node based on

multiple attributes Forced split is avoided

LSDh-tree Coded region description

• Reduce space requirement

16

SS-Tree Spheres as page regions Split

Split axis is determined as the dimension yielding the highest variance

Not amenable to an easy overlap-free split

17

Space Filling Curves Range and nearest-neighbor queries based on dis-

tance calculations of page regions

18

I

q

lb : 47 = 101111ub : 60 =111100longest common prefix : p =1s = <p100…000> = 110000 = 48

I1

I2

lb : 48 = 110000ub : 60 =111100longest common prefix : p =11s = <p100…000> = 111000 = 56

I21

I22

Pyramid Tree Divide the data space such that the resulting partitions

are shaped like peels of an onion Pyramid mapping

Optimized for range queries on high-dim. data Not affected by the curse of dimensionality

19

Summary & Comparison

20

Summary & Comparison

21

Conclusions Effects occurring in indexing high-dim. spaces Principal ideas of the index structures that have been

proposed to overcome the problems Research on high-dim. indexing has a major impact on

many practical applications and commercial multime-dia database system

Future Research Issues Real case (not uniform and not independent data) Partitioning strategies that perform well in high-dim. Approximate processing of NN queries

22

Searching in High-Dimensional Spaces

Documents