July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

High-dimensional indexing techniques

Kesheng John Wu

Ekow Otoo

Arie Shoshani

July, 2001

The big picture

gridstorage

MPI-IOfile

Request

Interpreter

dataset

Data mining

DistributedLarge

July, 2001

The big picture

Request interpreter

Logical request

Qualified objects

Request planning/execution

Execution services

grid

LBNL

PPDG

MPI-IO, …

Sub-task schedule

July, 2001

Problem statement

• Main objective: maps logical request to qualified objects

— a logical request:• 20001015<=eventTime & 200<energy<300 …

— objects:• set of object IDs;

• set of files containing the objects;

• offsets within the files, …

July, 2001

Requirements & Status

• General requirements

— User request data in terms of their scientific domain, not file names or offsets in files

— Each object may be described in hundreds of attributes

— Each request is in terms of range predicates on a handful of attributes (partial range query)

• Status

— Initially motivated by a HENP experiment: STAR

— Software originally developed under GC and is currently in use at BNL

July, 2001

Large high-dimensional datasets

• Number of attributes / columns: 200 – 500

• Number of objects / events: 108 – 109

• File containing one attribute: 400MB – 4GB

• Total size over all attributes: 80GB – 2TB

A1 A2 A3 A4 …Object ID0

1

2

.

.

.

109

108

.

.

.

Goal: develop an index, so that:

• Read as little as possible from disk

• Minimize computation in memory

Curse of dimensionality

July, 2001

Well known indexing methods

• B-tree based indices— One or a small number of attributes— Index size may be up to 3 times the data size

• R-tree based indices— Small number of attributes, say, < 10

• UB-tree— Use space filling curves to map high-dimensional

data to one-dimension— One range query is mapped into many many

queries on the B-tree based index• Even sequential scan

— Better than B-tree and R-tree if dimension > 10— Simply read all data and compare take too long

July, 2001

Another class of indexes: Bitmap index

• Example queries on the attribute, say, A

• One-sided range query: A < 2

— b0 OR b1

• Two-sided range query: 2<A<5

— b3 OR b4

• Basic steps of building a bitmap index

— Binning

— Encoding

— Compressing

Datavalues

015312041

100000100

010010001

000001000

000100000

000000010

001000000

=0 =1 =2 =3 =4 =5

b0 b1 b2 b3 b4 b5

July, 2001

How many bins?

Range(x)R

ange

(y)

Edge binEdge bin

.. ... ... ... ... ... .

.. ... ... ... ... ... ... ... ... ... .

.. ... .More bins

Less objects in edge bins

July, 2001

How to encode

Equalityencoding

Rangeencoding

Intervalencoding

6 bins 0 1 2 3 4 5

July, 2001

Advantages of bitmap indices

• Fast operations

— The most common operations are the bitwise logical operations

— They are well supported by hardware

• Easy to compress, potentially small index size

• Each individual bitmap is small and frequently used ones can be cached in memory

• Efficient for read-mostly data: data produced from scientific experiments can be appended in large groups

• Available in most major commercial DBMS

July, 2001

Why our own bitmap index

• Early tests shown that we can do an order of magnitude better than ORACLE (using equality encoding)

• Vertical partition: allows one to only read data of the attributes involved in a query

• New compression method

— Best known: Byte-aligned Bitmap Code (BBC)

— Developed 2 Word-Aligned Schemes: WAH, WBC

• Different encoding schemes under compression

— Equality encoding – used in ORACLE and others

— Range encoding – one-sided range queries

— Interval encoding – two-sided range queries

July, 2001

Information about the test machines

• Hardware and system

— Sun enterprise 450 (Ultrasparc II 400MHz)

— 4GB RAM

— VARITAS volume manager (stripped disk)

• Real application data from STAR

— Above 2 million objects

— Picked 12 attributes with varying distributions

• Measures:

— Logical operation time without IO

— Logical operation time with IO

— Query processing time

July, 2001

Logical operation time (no IO)

July, 2001

Logical operation time (including IO)

July, 2001

New compression schemes

• Overall, use about 50% more space than BBC

• On average, 12 times faster than BBC

• Faster than the uncompressed in more cases:

— New schemes are faster than the uncompressed scheme when the compression ratios are less than 0.3

— BBC is faster than the uncompressed when the compression ratios are less than 0.03

July, 2001

Sizes of bitmap indices

Conclusion:- equality encoding is most space efficient- Compression gain is at least a factor of 2.5

July, 2001

Average query processing time

Conclusion:- interval and range encoding are the best- For these cases, there is practically no penalty to compression

July, 2001

Interval encoding is better overall

Sequential scan time: 0.557 sec

July, 2001

Summary

• Better compression scheme

— 50% more space, but 10-12 time faster !!!

• Among the different encoding schemes

— the interval encoding is the better than the equality encoding and the range encoding

• Selecting the number of bins => Bitmap index size and operation efficiency. For example:

— 10% of data size => 3 x speed of sequential scan

— 20% of data size => 6 x speed of sequential scan

• Equality encoding currently used in the STAR experiment. Next version will include the interval encoding.

July, 2001

Future work

• Support NULL value and categorical values

• On-line update: add new data and update index without interrupting request processing

• Recovery mechanism for robustness

• Potential new applications: climate, astrophysics, biology

• Study different non-uniform binning strategies

• Integrate with conventional database system: to better handle metadata, to provide more versatile front-end

July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

Documents