Top Banner
CS 4432 1 CS4432: Database Systems II Basic indexing
79

CS 44321 CS4432: Database Systems II Basic indexing.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 1

CS4432: Database Systems II

Basic indexing

Page 2: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 2

Indexing : helps to retrieve data quicker for certain queries

value= 1,000,000

Select * FROM Emp WHERE salary = 1,000,000;Select * FROM Emp WHERE salary = 1,000,000;

Chapter 13

value

record

Page 3: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 3

Topics

• Sequential Index Files (chap 13.1)• Secondary Indexes (chap 13.2)

Page 4: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 4

Sequential File

2010

4030

6050

8070

10090

Page 5: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 5

Sequential File

2010

4030

6050

8070

10090

Dense Index

10203040

50607080

90100110120

Every record

is in index.

Page 6: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 6

Sequential File

2010

4030

6050

8070

10090

Sparse Index

10305070

90110130150

170190210230

Only first record

per block in index.

Page 7: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 7

Sequential File

2010

4030

6050

8070

10090

Sparse 2nd level

10305070

90110130150

170190210230

1090

170250

330410490570

Page 8: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 8

Note : DATA FILE or INDEX are “ordered files”.

Question:How would we lay them out on disk ?

- contiguous layout on disk ? - block-chained layout on disk ?

Page 9: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 9

Questions:

• Do we want to build a dense 2nd-level index for a dense index?

• Can we even do this ?

Sequential File2010

4030

6050

8070

10090

2nd level?1030507090

110130150170190210230

1090

170250330410490570

1st level?

Page 10: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 10

Notes on pointers:

(1)Block pointer (used in sparse index) can be smaller than record pointer (used in dense index)

BP

RP

Page 11: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 11

K1

K3

K4

K2

R1

R2

R3

R4

say:1024 Bper block

• if we want K3 block:• get it at offset (3-1)*1024 = 2048 bytes

Note : If file is contiguous, then we can omit pointers

Page 12: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 12

Sparse vs. Dense Tradeoff

• Sparse: Less index space per record can keep more of index in

memory (Later: sparse better for insertions)

• Dense: Can tell if any record exists without accessing file

(Later: dense needed for secondary indexes)

Page 13: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 13

Terms

• Index sequential file• Search key ( primary key)• Primary index (on sequencing field)• Secondary index• Dense index (contains all search

key values)• Sparse index• Multi-level index

Page 14: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 14

Next:

• Duplicate keys

• Deletion/Insertion

• Secondary indexes

Page 15: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 15

Duplicate keys

1010

2010

3020

3030

4540

Page 16: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 16

1010

2010

3020

3030

4540

1010

2010

3020

3030

4540

10101020

20303030

10101020

20303030

Dense index ! Point to each value !

Duplicate keys

Page 17: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 17

1010

2010

3020

3030

4540

Dense index. Point to each distinct value!

10203040

Duplicate keys

Page 18: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 18

1010

2010

3020

3030

4540

10102030

Sparse index: point to start of block !

Duplicate keys

care

ful if lookin

gfo

r 2

0 o

r 3

0!

Page 19: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 19

1010

2010

3020

3030

4540

10203030

Sparse index, another way ?

Duplicate keys

– place first new key from block

shouldthis be40?

Page 20: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 20

Duplicate values, primary index

• Index may point to first instance ofeach value only

File Index

Summary

aaa

b

Page 21: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 21

Next:

• Duplicate keys

• Deletion/Insertion

• Secondary indexes

Page 22: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 22

Deletion from sparse index

2010

4030

6050

8070

10305070

90110130150

Page 23: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 23

Deletion from sparse index

2010

4030

6050

8070

10305070

90110130150

– delete record 40

Page 24: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 24

Deletion from sparse index

2010

4030

6050

8070

10305070

90110130150

– delete record 30

4040

Page 25: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 25

Deletion from sparse index

2010

4030

6050

8070

10305070

90110130150

– delete records 30 & 40

5070

Page 26: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 lecture #8 26

Deletion from dense index

2010

4030

6050

8070

10203040

50607080

Page 27: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 27

Deletion from dense index

2010

4030

6050

8070

10203040

50607080

– delete record 30

4040

Page 28: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 28

Insertion, sparse index case

2010

30

5040

60

10304060

Page 29: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 29

Insertion, sparse index case

2010

30

5040

60

10304060

– insert record 34

34

• our lucky day! we have free space where we need it!

Page 30: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 30

Insertion, sparse index case

2010

30

5040

60

10304060

– insert record 15

15

2030

20

• Immediate reorganization• Other variations?

Page 31: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 31

• Just Illustrated: -Immediate reorganization

• Now Variation:– insert new block (chained file)

Page 32: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 32

Insertion, sparse index case

2010

30

5040

60

10304060

– insert record 25

25

overflow blocks(reorganize later...)

Page 33: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 33

Insertion, dense index case

• Similar

• Often more expensive . . .

Page 34: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 34

Next:

• Duplicate keys

• Deletion/Insertion

• Secondary indexes

Page 35: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 35

Secondary indexesSequencefield

5030

7020

4080

10100

6090

Can I make a

secondary

index sparse ?

Page 36: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 36

Secondary indexesSequencefield

5030

7020

4080

10100

6090

• Sparse index

302080

100

90...

does not make sense!

Page 37: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 37

Secondary indexesSequencefield

5030

7020

4080

10100

6090

• Must be dense index !10203040

506070...

105090...

sparsehighlevel

allowed?

Page 38: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 38

With secondary indexes:

• Lowest level is dense• Other levels are sparse

Also: Pointers are record pointers

(not block pointers; not computed)

Page 39: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 39

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

Page 40: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 40

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

10101020

20304040

4040...

one option...

Problem:excess overhead!

• disk space• search time

Page 41: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 41

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

10

another option...

4030

20Problem:variable sizerecords inindex!

Page 42: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 42

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

10203040

5060...

Another idea :Chain records with same key !

Problems:• Need to add fields to data records for each index• Need to follow chain to know records

Page 43: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 43

Summary : Conventional Indexes

– Basic Ideas: sparse, dense, multi-level…

– Duplicate Keys– Deletion/Insertion– Secondary indexes

Page 44: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 44

Multi-level Index StructuresSequencefield

5030

7020

4080

10100

6090

firstlevel

(dense,if non-

sequential)

10203040

506070...

105090...

highLevel

(alwayssparse)

1

2

5

43

Page 45: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 45

Sequential indexes : pros/cons ?

Advantage:- Simple- Index is sequential file

good for scans - Search efficient for static data

Disadvantage:

- Inserts expensive, and/or- Lose sequentiality & balance

- Then search time unpredictable

Page 46: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 46

Example Sequential Index

continuous

free space

102030

405060

708090

39313536

323834

33

overflow area(not sequential)

Page 47: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 47

Another type of index

• Give up “sequentiality” of index• Predictable performance under

updates• Achieve always balance of “tree” • Automate restructuring under

updates

Page 48: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 48

Root

B+Tree Example n=3

100

120

150

180

30

3 5 11

30

35

100

101

110

120

130

150

156

179

180

200

Page 49: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 49

Sample non-leaf

to keys to keys to keys to keys

< 57 57 k<81 81k<95 95

57

81

95

Page 50: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 50

Sample leaf node:

From non-leaf node

to next leafin

sequence5

7

81

95

To r

eco

rd

wit

h k

ey 5

7

To r

eco

rd

wit

h k

ey 8

1

To r

eco

rd

wit

h k

ey 8

5

Page 51: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 51

In textbook’s notationn=3

Leaf:

Non-leaf:

30

35

30

30 35

30

Page 52: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 52

Size of nodes: n+1 pointersn keys

(fixed)

Page 53: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 53

Don’t want nodes to be too empty

• Use at least

Non-leaf: (n+1)/2pointers

Leaf: (n+1)/2 pointers to data

Page 54: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 54

Full nodemin. node

Non-leaf

Leaf

n=3

12

01

50

18

0

30

3 5 11

30

35

counts

even if

null

Non-leaf: (n+1)/2 pointers

Leaf: (n+1)/2 pointers to data

Page 55: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 55

B+tree rules tree of order n

(1) All leaves at same lowest level(balanced tree)

(2) Pointers in leaves point to records except for “sequence pointer”

Page 56: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 57

Root

B+Tree Example : Searches

100

120

150

180

30

3 5 11

30

35

100

101

110

120

130

150

156

179

180

200

Page 57: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 58

Insert into B+tree

(a) simple case– space available in leaf

(b) leaf overflow(c) non-leaf overflow(d) new root

Page 58: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 59

(a) Insert key = 32 n=33 5 11

30

31

30

100

32

Page 59: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 60

(a) Insert key = 7 n=3

3 5 11

30

31

30

100

3 5

7

7

Page 60: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 61

(c) Insert key = 160 n=3

10

0

120

150

180

150

156

179

180

200

160

18

0

160

179

Page 61: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 62

(d) New root, insert 45 n=3

10

20

30

1 2 3 10

12

20

25

30

32

40

40

45

40

30new root

Page 62: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 63

Recap: Insert Data into B+ Tree

• Find correct leaf L. • Put data entry onto L.

– If L has enough space, done!– Else, must split L (into L and a new node L2)

• Redistribute entries evenly, copy up middle key.• Insert index entry pointing to L2 into parent of L.

• This can happen recursively– To split index node, redistribute entries evenly, but

push up middle key. (Contrast with leaf splits.)

• Splits “grow” tree; root split increases height. – Tree growth: gets wider or one level taller at top.

Page 63: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 64

(a) Simple case (b) Coalesce with neighbor (sibling)

(c) Re-distribute keys(d) Cases (b) or (c) at non-leaf

Deletion from B+tree

Page 64: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 65

(a) Delete key = 11 n=33 5 11

30

31

30

100

Page 65: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 66

(b) Coalesce with sibling– Delete 50

10

40

100

10

20

30

40

50

n=4

40

Page 66: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 67

(c) Redistribute keys– Delete 50

10

40

100

10

20

30

35

40

50

n=4

35

35

Page 67: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 68

40

45

30

37

25

26

20

22

10

141 3

10

20

30

40

(d) Coalese and Non-leaf coalese– Delete 37

n=4

40

30

25

25

new root

Page 68: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 69

B+tree deletions in practice

– Often, coalescing is not implemented– Too hard and not worth it!

Page 69: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 70

Delete Data from B+ Tree

• Start at root, find leaf L where entry belongs.• Remove the entry.

– If L is at least half-full, done! – If L has only d-1 entries,

• Try to re-distribute, borrowing from sibling (adjacent node with same parent as L).

• If re-distribution fails, merge L and sibling.

• If merge occurred, must delete entry (pointing to L or sibling) from parent of L.

• Merge could propagate to root, decreasing height.

Page 70: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 71

• Concurrency control harder in B-Trees• B-tree consumes more space• DBA does not know when to reorganize• DBA does not know how full to load pages of new index• Buffering

– B-tree: has fixed buffer requirements– Static index: must read several overflow blocks to be efficient (large & variable size

buffers needed)

Comparison: B-trees vs. static indexed sequential file

Page 71: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 72

• Speaking of buffering… Is LRU a good policy for B+tree

buffers?Of course not!

Should try to keep root in memory at all times

(and perhaps some nodes from second level)

Page 72: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 73

ComparisonB-tree vs. indexed seq.

file• Less space, so

lookup faster• Inserts managed

by overflow area• Requires

temporary restructuring

• Unpredictable performance

• Consumes more space, so lookup slower

•Each insert/delete potentially restructures

•Build-in restructuring

• Predictable performance

Page 73: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 74

Interesting problem:

For B+tree, how large should n be?

n is number of keys / node

Page 74: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 75

assumptions: n children per node and N records in database

(1) Time to read B-Tree node from disk is (tseek + tread*n) msec.(2) Once in main memory, use binary search to locate key, (a + b log_2 n) msec(3) Need to search (read) log_n (N) tree nodes

(4) t-search = (tseek + tread*n + (a + b*log_2(n)) * log n (N)

Page 75: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 76

Can get: f(n) = time to find a record

f(n)

nopt n

FIND nopt by f’(n) = 0

What happens to nopt as:•Disk gets faster? CPU get faster? …

Page 76: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 77

Bulk Loading of B+ Tree

• For large collection of records, create B+ tree.• Method 1: Repeatedly insert records slow.• Method 2: Bulk Loading more efficient.

Page 77: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 78

Bulk Loading of B+ Tree

• Initialization: – Sort all data entries – Insert pointer to first (leaf) page in new (root) page.

3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44*

Sorted pages of data entries; not yet in B+ treeRoot

Page 78: CS 44321 CS4432: Database Systems II Basic indexing.

CS 443279

Bulk Loading (Contd.)

• Index entries for leaf pages always entered into right-most index page

• When this fills up, it splits.

Split may go up right-most path to root.

3* 4* 6* 9* 10*11* 12*13* 20*22* 23* 31* 35*36* 38*41* 44*

Root

Data entry pages

not yet in B+ tree3523126

10 20

3* 4* 6* 9* 10* 11* 12*13* 20*22* 23* 31* 35*36* 38*41* 44*

6

Root

10

12 23

20

35

38

not yet in B+ treeData entry pages

Page 79: CS 44321 CS4432: Database Systems II Basic indexing.

CS 4432 80

Summary of Bulk Loading

• Method 1: multiple inserts.– Slow.– Does not give sequential storage of leaves.

• Method 2: Bulk Loading – Has advantages for concurrency control.– Fewer I/Os during build.– Leaves will be stored sequentially (and

linked) – Can control “fill factor” on pages.