Searching in High-Dimensional Spaces Index Structures for Improving the Performance of Multimedia Databases Christian Böhm, Stefan Berchtold, Daniel A. Keim ACM Computing Surveys, 2001
Jan 10, 2016
Searching in High-Dimensional SpacesIndex Structures for Improving the Performance of
Multimedia Databases
Christian Böhm, Stefan Berchtold, Daniel A. KeimACM Computing Surveys, 2001
Introduction Multimedia databases have become increasingly im-
portant in many application areas Content-based retrieval of similar objects Similarity search
Feature transformation• Multimedia object → high dimensional points (feature vector)
Search of points in the feature space that are close to a given query point
2
Traditional Databases
Point, range, partial match query
Multimedia Databases
Similarity search
Similarity Queries Basic idea of feature-based similarity search
3
ε-Searchor NN-Search
FeatureTransformation
Insert
Complex Data Objects High-Dim. Feature Vectors High-Dim. Index
NN
Range query Nearest-neighbor query
Effects in High-Dimensional Space Curse of dimensionality
Can you imagine 5 or 10-dimension? “Every d-dimensional sphere touching (or intersecting) the
(d-1)-dimensional boundaries of the data space contains c” What happen if d=16?
4
Effects in High-Dimensional Space Issues
Exponential growth of volume
Space partitioning• The majority of the data pages are located at the surface of the
data space rather than in the interior• Coarse partitioning
5
0.5
0.50.25 917.025.016 0.917
0.917
Common Principles Structure & Regions
Hierarchical clustering Spatially adjacent vectors are likely to reside in the same
node
6
Basic Algorithms Index construction
Insert, Delete, and Update Query processing
Exact match query Range query Nearest-neighbor query Ranking query (generalized k-nearest-neighbor query) Reverse nearest-neighbor query
7
Nearest-Neighbor Query No fixed criterion, known a priori, to exclude branches
of the indexing structure The criterion is the nearest-neighbor distance But it is not known until the algorithm has terminated
• Pessimistic estimation• The closest point among all points visited
(closest point candidate)
8
Nearest-Neighbor Query RKV algorithm
MINDIST : the actual distance between the query point and page region
MINMAXDIST : estimation of the nearest neighbor distance ‘Depth-first’ and ‘Branch and bound’ traversal
9
MINDISTMINMAXDIST
Nearest-Neighbor Query HS algorithm
Access all pages of the index in the order of increasing dis-tance to the query point
Active page list (APL)
10
p3
p1
p2
p31
p1
p33
p2
p32
p1
p311
p312
p33
p2
p32
p11
p311
p312
p33
p2
p32
p13
p12
p311
p312
p33
p111
p2
p112
p32
p13
p12
Nearest-Neighbor Query Comparison
RKV• pr1 → pr12 → pr11 →…
HS• pr1 → pr2 → pr21
11
Index Structures Minimum bounding rectangles
R-tree family X-tree
Bounding spheres SS-tree TV-tree
Combined regions SR-tree
Etc. Space filling curves Pyramid-tree
12
R, R*, R+-Tree Overlap problem
For an overlap-free split, a dimension is needed in which the projections of the page regions have no overlap at some point• Existence of such a point becomes less likely as the dimension
of the data space increases
R+ tree An overlap-free variant of the R-tree using a forced-split strat-
egy High dimensionality leads to many forced-split operations.
• Storage utilization < 50%
13
8409.02/1
7071.02/14
2
d
effCAa /1A
a
X-Tree Extension of the R*-tree Designed for the management of high-dimensional ob-
jects Overlap-free split (split history) Supernodes (unbalanced split tree)
14
kd-Tree Advantage
Guarantee of no overlap Disadvantages
Complete partitioning• Page regions are generally larger than necessary which yields a
higher access probability
Unbalanced
15
kd-Tree kd-B-tree
Balanced kd-tree Forced split
hB-tree Splitting a node based on
multiple attributes Forced split is avoided
LSDh-tree Coded region description
• Reduce space requirement
16
SS-Tree Spheres as page regions Split
Split axis is determined as the dimension yielding the highest variance
Not amenable to an easy overlap-free split
17
Space Filling Curves Range and nearest-neighbor queries based on dis-
tance calculations of page regions
18
I
q
lb : 47 = 101111ub : 60 =111100longest common prefix : p =1s = <p100…000> = 110000 = 48
I1
I2
lb : 48 = 110000ub : 60 =111100longest common prefix : p =11s = <p100…000> = 111000 = 56
I21
I22
Pyramid Tree Divide the data space such that the resulting partitions
are shaped like peels of an onion Pyramid mapping
Optimized for range queries on high-dim. data Not affected by the curse of dimensionality
19
Summary & Comparison
20
Summary & Comparison
21
Conclusions Effects occurring in indexing high-dim. spaces Principal ideas of the index structures that have been
proposed to overcome the problems Research on high-dim. indexing has a major impact on
many practical applications and commercial multime-dia database system
Future Research Issues Real case (not uniform and not independent data) Partitioning strategies that perform well in high-dim. Approximate processing of NN queries
22