Top Banner
July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani
21

July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

Dec 28, 2015

Download

Documents

Lindsay Logan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

High-dimensional indexing techniques

Kesheng John Wu

Ekow Otoo

Arie Shoshani

Page 2: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

The big picture

gridstorage

MPI-IOfile

Request

Interpreter

dataset

Data mining

DistributedLarge

Page 3: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

The big picture

Request interpreter

Logical request

Qualified objects

Request planning/execution

Execution services

grid

LBNL

PPDG

MPI-IO, …

Sub-task schedule

Page 4: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

Problem statement

• Main objective: maps logical request to qualified objects

— a logical request:• 20001015<=eventTime & 200<energy<300 …

— objects:• set of object IDs;

• set of files containing the objects;

• offsets within the files, …

Page 5: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

Requirements & Status

• General requirements

— User request data in terms of their scientific domain, not file names or offsets in files

— Each object may be described in hundreds of attributes

— Each request is in terms of range predicates on a handful of attributes (partial range query)

• Status

— Initially motivated by a HENP experiment: STAR

— Software originally developed under GC and is currently in use at BNL

Page 6: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

Large high-dimensional datasets

• Number of attributes / columns: 200 – 500

• Number of objects / events: 108 – 109

• File containing one attribute: 400MB – 4GB

• Total size over all attributes: 80GB – 2TB

A1 A2 A3 A4 …Object ID0

1

2

.

.

.

109

108

.

.

.

Goal: develop an index, so that:

• Read as little as possible from disk

• Minimize computation in memory

Curse of dimensionality

Page 7: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

Well known indexing methods

• B-tree based indices— One or a small number of attributes— Index size may be up to 3 times the data size

• R-tree based indices— Small number of attributes, say, < 10

• UB-tree— Use space filling curves to map high-dimensional

data to one-dimension— One range query is mapped into many many

queries on the B-tree based index• Even sequential scan

— Better than B-tree and R-tree if dimension > 10— Simply read all data and compare take too long

Page 8: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

Another class of indexes: Bitmap index

• Example queries on the attribute, say, A

• One-sided range query: A < 2

— b0 OR b1

• Two-sided range query: 2<A<5

— b3 OR b4

• Basic steps of building a bitmap index

— Binning

— Encoding

— Compressing

Datavalues

015312041

100000100

010010001

000001000

000100000

000000010

001000000

=0 =1 =2 =3 =4 =5

b0 b1 b2 b3 b4 b5

Page 9: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

How many bins?

Range(x)R

ange

(y)

Edge binEdge bin

.. ... ... ... ... ... .

.. ... ... ... ... ... ... ... ... ... .

.. ... .More bins

Less objects in edge bins

Page 10: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

How to encode

Equalityencoding

Rangeencoding

Intervalencoding

6 bins 0 1 2 3 4 5

Page 11: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

Advantages of bitmap indices

• Fast operations

— The most common operations are the bitwise logical operations

— They are well supported by hardware

• Easy to compress, potentially small index size

• Each individual bitmap is small and frequently used ones can be cached in memory

• Efficient for read-mostly data: data produced from scientific experiments can be appended in large groups

• Available in most major commercial DBMS

Page 12: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

Why our own bitmap index

• Early tests shown that we can do an order of magnitude better than ORACLE (using equality encoding)

• Vertical partition: allows one to only read data of the attributes involved in a query

• New compression method

— Best known: Byte-aligned Bitmap Code (BBC)

— Developed 2 Word-Aligned Schemes: WAH, WBC

• Different encoding schemes under compression

— Equality encoding – used in ORACLE and others

— Range encoding – one-sided range queries

— Interval encoding – two-sided range queries

Page 13: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

Information about the test machines

• Hardware and system

— Sun enterprise 450 (Ultrasparc II 400MHz)

— 4GB RAM

— VARITAS volume manager (stripped disk)

• Real application data from STAR

— Above 2 million objects

— Picked 12 attributes with varying distributions

• Measures:

— Logical operation time without IO

— Logical operation time with IO

— Query processing time

Page 14: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

Logical operation time (no IO)

Page 15: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

Logical operation time (including IO)

Page 16: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

New compression schemes

• Overall, use about 50% more space than BBC

• On average, 12 times faster than BBC

• Faster than the uncompressed in more cases:

— New schemes are faster than the uncompressed scheme when the compression ratios are less than 0.3

— BBC is faster than the uncompressed when the compression ratios are less than 0.03

Page 17: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

Sizes of bitmap indices

Conclusion:- equality encoding is most space efficient- Compression gain is at least a factor of 2.5

Page 18: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

Average query processing time

Conclusion:- interval and range encoding are the best- For these cases, there is practically no penalty to compression

Page 19: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

Interval encoding is better overall

Sequential scan time: 0.557 sec

Page 20: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

Summary

• Better compression scheme

— 50% more space, but 10-12 time faster !!!

• Among the different encoding schemes

— the interval encoding is the better than the equality encoding and the range encoding

• Selecting the number of bins => Bitmap index size and operation efficiency. For example:

— 10% of data size => 3 x speed of sequential scan

— 20% of data size => 6 x speed of sequential scan

• Equality encoding currently used in the STAR experiment. Next version will include the interval encoding.

Page 21: July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

July, 2001

Future work

• Support NULL value and categorical values

• On-line update: add new data and update index without interrupting request processing

• Recovery mechanism for robustness

• Potential new applications: climate, astrophysics, biology

• Study different non-uniform binning strategies

• Integrate with conventional database system: to better handle metadata, to provide more versatile front-end