Top Banner
COMP 430 Intro. to Database Systems Indexing
43

COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Jul 15, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

COMP 430Intro. to Database Systems

Indexing

Page 2: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

How does DB find records quickly?

• Various forms of indexing

• An index is automatically created for primary key.

• SQL gives us some control, so we should understand the options.• Concerned with user-visible effects, not underlying implementation.

CREATE INDEX index_city_salaryON Employees (city, salary);

CREATE UNIQUE CLUSTERED INDEX idxON MyTable (attr1 DESC, attr2 ASC);

Options vary.

Page 3: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Phone book model

Data is stored with search key.

Clustered index.

Organized by one search key:

• Last name, first name.

• Searching by any other key is slow.

Page 4: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Library card catalog model

Index stores pointers to data.

Non-clustered index.

Organized by search key(s).

• Author last name, first name.

• Title

• Subject

Page 5: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Evaluating an indexing scheme

• Access type flexibility• Specific key value – e.g., “John”, “Smith”• Key value range – e.g., salary between $50K and $60K

Advantage:

• Access time

Disadvantages:

• Update time

• Space overheadCreating an index can slow down the system!

Page 6: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Ordered indicesSorted, as in previous human-oriented examples

Page 7: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Types of indices we’ll see

• Dense vs. sparse

• Primary vs. secondary

• Unique vs. not-unique

• Single- vs. multi-level

These ideas can be combined in various ways.

Page 8: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Primary index – the clustering index

10 10

Index Data

30

80

100

140

20

30

40

50

60

Typically unique index also, since primary search key typically same as primary key.

10 10

Index Data

20

30

40

50

20

30

40

50

60

DenseSparse

What were primary search keys in phone book & library card catalog?

Page 9: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Primary index – trade-off dense vs. sparse

• Dense – faster access

• Sparse – less update time, less space overhead

10 10

Index Data

40

70

100

130

20

30

40

50

60

70

Good trade-off: Sparse, but link to first key of each file block.

Page 10: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Secondary index – a non-clustering index

10 30

Index Data

20

30

50

100

100

20

10

50

140

…What were secondary

search keys in phone book & library card catalog?

Needs to be dense, otherwise we can’t find all search keys efficiently.

Page 11: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Secondary index typically not unique

10 30

Index Data

20

30

80

100

100

20

10

30

10

Buckets

Page 12: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Index size

• Dense index – search key + pointer per record

• Sparse index – search key + pointer per file block (typically)

Many records, but want index to fit in memory

Solution: Multi-level index

Same problem & solution as for page tables in

virtual memory.

Page 13: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Multi-level primary index

10 10

(Semi-)dense Index Data

40

70

100

130

20

30

40

50

60

70

80

90

100

160

190

220

250

10

130

250

250

Sparse Index

Page 14: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Multi-level secondary index

10 70

Dense Index Data

20

30

40

50

40

30

100

10

90

60

20

80

50

60

70

80

90

10

50

90

130

Sparse Index

Page 15: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Multi-level index summary

• Access time now very locality-dependent

• Update time increased – must update multiple indices

• Total space overhead increased

Page 16: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Updating database & indices – inserting

10 10

Index Data

40

70

100

130

20

30

40

50

60

70

30

Moving all records is too expensive.Have same issue when updating indices.

10

20

25

40

50

60

70

Data

Page 17: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Updating database & indices – deleting

10 10

Index Data

40

70

100

130

20

30

40

50

60

70

30

Moving all records is too expensive.Have same issue when updating indices.

10

20

25

40

60

70

Data

Page 18: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Updating file leads to fragmentation

Performance degrades with time.

Need to periodically reorganize data & indices.

Solution: B+-trees

Page 19: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

B+-tree indices

Page 20: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

B+-trees

• Balanced search trees optimized for disk-based access• Reorganizes a little on every update

• Shallow & wide (high fan-out)

• Typically, node size = disk block

• Slight differences from more commonly-known B-trees• All data in leaf nodes

• Leaf nodes sequentially linkedEasily get all data in order.

CREATE INDEX … ON … USING BTREE;Or, is the default in many DBMSs.

Page 21: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

B+-tree example

10

0

30

12

01

50

18

0

3 5 11

30

35

10

01

01

11

0

12

01

30

15

01

56

17

9

18

02

00

… Data records …

Page 22: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

B+-tree performance

• Very similar to multi-level index structure

− Slightly higher per-operation access & update time

+ No degradation over time

+ No periodic reorganization

B+-tree widely used in relational DBMSs

Page 23: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

What about non-unique search keys?

• Allow duplicates in tree. Maintain order instead of <.• Slightly complicates tree operations.

• Make unique by adding record-ID.• Extra storage. But, record-ID useful for other purposes, too.

• List of duplicate records for each key.• Trivial to get all duplicates.

• Inefficient when lists get long.

Page 24: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Indexing on VARCHAR keys

• Key size variable, so number of keys that fit into a node also varies.

• Techniques to maximize fan-out:• Key values at internal nodes can be prefixes of full key. E.g., “Johnson” and

“Jones” can be separated by “Jon”.

• Key values at leaf nodes can be compressed by sharing common prefixes. E.g., “Johnson” and “Jones” can be stored as “Jo” + “hnson”/”nes”

Page 25: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Hash indices

Page 26: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Basic idea of a hash function

value1

value2

value3

value4

h() distributes values uniformly among the buckets.

h() distributes typical subsets of values uniformly among the buckets.

Page 27: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Using hash function for indexing

CREATE INDEX … ON … USING HASH;

value1

value3

value4

value1value4

value3

Use hash function to find bucket.Search/insert/delete from bucket.Bucket items possibly sorted.

Motivation: Constant-time hash instead of multi-level/tree-based.

Page 28: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Buckets can overflow

• Overflow some buckets – skewed usage

• Overflow many buckets – not enough space reserved

Solutions:

• Chain additional buckets – degrades to linear search

• Stop and reorganize

• Dynamic hashing (extensible or linear hashing) – techniques that allow the number of buckets to grow without rehashing existing data

Page 29: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Advantages & disadvantages of hash indexing

+ One hash function vs. multiple levels of indexing or B+-tree

• Storage about the same.+ Don’t need multiple levels.

− But, need buckets about half empty for good performance.

- Can’t easily access a data range.

- Simple hashing degrades poorly when data is skewed relative to the hash function.

Page 30: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Multiple indices & Multiple keys

Page 31: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Using indices for multiple attributes

What if queries like the following are common?

SELECT …FROM EmployeeWHERE job_code = 2 AND performance_rating = 5;

Page 32: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Strategy 1 – Index one attribute

CREATE INDEX idx_job_code on Employee (job_code);

SELECT …FROM EmployeeWHERE job_code = 2 AND performance_rating = 5;

Internal strategy:1. Use index to find Employees with job_code = 2.2. Linear search of those to check performance_rating = 5.

Page 33: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Strategy 2 – Index both attributes

CREATE INDEX idx_job_code on Employee (job_code);CREATE INDEX idx_perf_rating on Employee (performance_rating);

SELECT …FROM EmployeeWHERE job_code = 2 AND performance_rating = 5;

Internal strategy – chooses between• Use job_code index, then linear search on performance_rating.• Use performance_rating index, then linear search on job_code.• Use both indices, then intersect resulting sets of pointers.

Page 34: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Strategy 3 – Index attribute set

CREATE INDEX idx_job_perf on Employee (job_code, performance_rating);

SELECT …FROM EmployeeWHERE job_code = 2 AND performance_rating = 5;

Attribute sets ordered lexicographically:

(jc1, pr1) < (jc2, pr2) iff either• jc1 < jc2• jc1 = jc2 and pr1 < pr2

Note that this prioritizes job_code over performance_rating!

This strategy typically uses ordered index,

not hashing.

Page 35: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Strategy 3 – Index attribute set

CREATE INDEX idx_job_perf on Employee (job_code, performance_rating);

SELECT … FROM Employee WHERE job_code = 2 AND performance_rating = 5;

SELECT … FROM Employee WHERE job_code = 2 AND performance_rating < 5;

Efficient

CREATE INDEX idx_job_perf on Employee (job_code, performance_rating);

SELECT … FROM Employee WHERE job_code < 2 AND performance_rating = 5;

SELECT … FROM Employee WHERE job_code = 2 OR performance_rating = 5;

Inefficient

Page 36: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Strategy 4 – Grid indexing

• 𝑛 attributes viewed as being in 𝑛-dimensional grid

• Numerous implementations• Grid file, K-D tree, R-tree, …

• Mainly used for spatial data

CREATE SPATIAL INDEX … ON …;

Page 37: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Strategy 5 – Bitmap indices

• Coming next…

Page 38: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Bitmap indices

Page 39: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Basic idea of bitmap indices

• Assume records numbered 0, 1, …, and easy to find record #𝑖.

CREATE BITMAP INDEX … ON …;

Record # id gender income_level

0 51351 M 1

1 73864 F 2

2 13428 F 1

3 53718 M 4

4 83923 F 3

M F 1 2 3 4 5

1 0 1 0 0 0 0

0 1 0 1 0 0 0

0 1 1 0 0 0 0

1 0 0 0 0 1 0

0 1 0 0 1 0 0

Bitmaps

F’s bitmap: 01101

Page 40: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Queries use standard bitmap operations

SELECT … FROM … WHERE gender = ‘F’ AND income_level = 1;

F’s bitmap01101

1’s bitmap10100

& = 00100

SELECT … FROM … WHERE gender = ‘F’ OR income_level = 1;

F’s bitmap01101

1’s bitmap10100

| = 11101

SELECT … FROM … WHERE income_level <> 1;

1’s bitmap10100

~ = 01011

Page 41: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Space overhead

• Normal: #records (#values + 1) bits, if nullable• Typically used when # of values for attribute is low.

• Encoded: #records log(#values)• But need to use more bitmaps

• Compressed: ~50% of normal• Can do bitmap operations on compressed form.

Page 42: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

A couple details

• When deleting records, either …• Delete bit from every bitmap for that table, or

• Have an existence bitmap indicating whether that row exists.

• Can use idea in B+-tree leaves.

Page 43: COMP 430 Intro. to Database Systems - Rice University · 2016-03-23 · •SQL gives us some control, so we should understand the options. •Concerned with user-visible effects,

Summary of index types

Single-level Index generally too large to fit in memory

Multi-level Degrades due to fragmentation

B+-tree General-purpose choice

Spatial Good for spatial coordinates

Hash Good for equality tests; potentially degrades due to skew & overflow

Bitmap Good for multiple attributes each with few values