Top Banner
CS411 Database Systems Kazuhiro Minami 10: Indexing 2 11: Query Execution
35

CS411 Database Systems

Jan 22, 2016

Download

Documents

udell

CS411 Database Systems. Kazuhiro Minami. 10: Indexing 2 11: Query Execution. Revisiting Sequential Indexes on a Sequential Data File. Main memory buffer. Q: how many disk I/O’s do we need to get a record with key value‘150’? - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS411 Database Systems

CS411Database Systems

Kazuhiro Minami

10: Indexing 211: Query Execution

Page 2: CS411 Database Systems

Revisiting Sequential Indexes on a Sequential Data File

10

30

90

110

10

2030

40

50

60

70

80

data file

index file

90

100

110

140

150

160

50

70

150

180

Q: how many disk I/O’s do we need to get a record with key value‘150’?Q: If we want to avoid a binary search on index blocks, what can we do?

Main memorybuffer

Page 3: CS411 Database Systems

Direct Addressing Approach

10

15

30

35

10

2030

40

50

70

80

85

data fileindex file

90

100

110

140

150

160

20

25

40

45

• Suppose that a key value is a multiple of 5• We add an entry for every possible key value in index• If we look up a record with key ‘50’, • then, we can figure out that we should look up the 5th index block• Q: How many disk I/O’s do we need in this scheme?• Q: Is there any problem?

50

55

70

75

60

65

80

85

NULL

NULL

NULL

NULL

NULL

NULL

NULL

NULL

Many more index

blocks!

Page 4: CS411 Database Systems

Hashing-based Approach90

20

110

10

2030

40

50

70

80

85

data fileindex file

90

100

110

140

150

160

10

100

30

40

85

150

50

140

70

160

• Consider a hash function h(v) = v mod 9• Pointer for value v goes to h(v)th index block• Note that we only store only pointers to existing records• Q: How many index blocks do we need?• Q: How many disk I/O’s do we need to find a record with value ‘50’?• Q: Any other observations?

0

1

2

3

4

5

7

6

80

8

Page 5: CS411 Database Systems

However, as we have more records, we need overflow blocks

90

180

20

110

index file

10

100

30

57

40

85

150

50

140

70

160

0

1

2

3

4

5

7

6

80

8

270

360

450

540

190

280

370

550

Page 6: CS411 Database Systems

Hash Tables

• Secondary storage hash tables are much like main memory ones

• Recall basics:– There are n buckets– A hash function f(k) maps a key k to {0, 1, …, n-1}– Store in bucket f(k) a pointer to record with key k

• Secondary storage: bucket = block, use overflow blocks when needed

Page 7: CS411 Database Systems

Extensible Hash Table

• Allows hash table (i.e., #buckets) to grow, to avoid performance degradation

• Assume a hash function h that returns numbers in {0, …, 2k – 1}

• Instead of using a different hash function for each i = 1,…,k, we use the same hash function h

• How?

• The trick is to only look at first i most significant bits 2i << 2k where 2i is #buckets n

Page 8: CS411 Database Systems

Linear Hash Table

• Idea: extend only one entry at a time• Use the i bits at the end of a hash value as a bucket ID• Problem: #buckes n = no longer a power of 2• Let i be #bits necessary to address n buckets; that is,

– 2i-1 < n <= 2i

• We don’t have a bucket for hash value v where n <= v < 2i

• If n <= k, change most significant bit of k from 1 to 0– if i = 3, n = 5, k = 110 (= 6), entries for k go to the bucket for

010 (=2).

Page 9: CS411 Database Systems

Linear Hash Table Example

• N=3

(01)00

(11)00

(10)10

i=2

000110

(01)11 BIT FLIP

11

Because we do not have a bucket for 11 yet.

(01)11

Page 10: CS411 Database Systems

Linear Hash Table Example

• Insert 1000: overflow blocks…

(01)00

(11)00

(10)10

i=2

000110

(01)11

(10)00

Page 11: CS411 Database Systems

Linear Hash Tables

• Extension: independent on overflow blocks

• Extend n:=n+1 when average number of records per block exceeds (say) 80%

Page 12: CS411 Database Systems

Linear Hash Table Extension• From n=3 to n=4,

(01)00

(11)00

(10)10

i=2

000110

(01)11(01)11

i=2

000110

(10)10

(01)00

(11)00

11

Only need to touchone block (which one ?)

Current number of records r <= 1.6 * n.

(01)11

Page 13: CS411 Database Systems

Linear Hash Table Extension

• From n=3 to n=4 finished

• Insert 1001

• Need extension from n=4to n=5 (new bit)

(01)11

i=2

000110

(10)10

(01)00

(11)00

11

(10)01

Page 14: CS411 Database Systems

Linear Hash Table Extension

• From n=3 to n=4 finished

• Extension from n=4to n=5 (new bit)

• No change to the data structure is necessary

(1)001

(0)111

i=3

000001010

(1)010

011100

This record stay s here because no bucket for ‘111’.

(0)100

(1)100

(0)100

(1)100

Split records in this bucket

Page 15: CS411 Database Systems

Components of Query Processor

SQL query

Querycompilation

Queryexecution

query plan

storage

data

Metadata

Parse query

Select logicalquery plan

Select physical plan

SQL query

query expression tree

logical query plan tree

physical query plan tree

We must supply detail regarding how the query is to be executed.

Query

optimization

Page 16: CS411 Database Systems

Outline

• Logical/physical operators

• Cost parameters

• One-pass algorithms

• Nested-loop joins

• Two-pass algorithms based on sorting

Page 17: CS411 Database Systems

Logical v.s. Physical Operators

• Logical operators– what they do– e.g., union, selection, project, join, grouping

• Physical operators– how they do it– Principal methods: scanning, hashing, sorting, and

indexing– Consider assumptions as to the amount of available

main memory– e.g., nested loop join, sort-merge join, hash join,

index join

Page 18: CS411 Database Systems

Physical Query Plans

Purchase Person

P.Buyer=Q.name

Q.City=‘urbana’

P.buyer

(Simple Nested Loop Join)

SELECT P.buyerFROM Purchase P, Person QWHERE P.buyer=Q.name AND Q.city=‘urbana’

SELECT P.buyerFROM Purchase P, Person QWHERE P.buyer=Q.name AND Q.city=‘urbana’

Query Plan:• Logical tree• Implementation choice at every node• Scheduling of operations.

(Table scan) (Index scan)

Some operators are from relationalalgebra, and others (e.g., scan, group)are not.

Page 19: CS411 Database Systems

The I/O Model of Computation

• In main memory algorithms, we care about CPU time

• In databases, time is dominated by I/O cost

• Assumption: cost is given only by I/O

• Consequence: need to redesign certain algorithms

Page 20: CS411 Database Systems

Cost Parameters

• Cost parameters – M = number of blocks that fit in main memory– B(R) = number of blocks holding R– T(R) = number of tuples in R– V(R,a) = number of distinct values of the attribute a

• Estimating the cost:– Important in optimization (next topic)– Compute I/O cost only– We consider the cost to read the tables – We don’t include the cost to write the result (because pipelining)

Page 21: CS411 Database Systems

Scanning Tables

• The table is clustered (I.e. blocks consists only of records from this table):– Table-scan: if we know where the blocks are– Index scan: if we have a sparse index to find the

blocks

• The table is unclustered (e.g. its records are placed on blocks with those of other tables)– May need one block read for each record

Page 22: CS411 Database Systems

Scanning Clustered/Uncluserted Tables

Clustered table Unclustered table

2 Block Reads

(B(R) = 2) 4 Reads(T(R) = 4)

Page 23: CS411 Database Systems

Cost of the Scan Operator

• Clustered relation:– Table scan: B(R)

– Index scan: B(R) ignoring the cost for reading a index file

• Unclustered relation– T(R)

We assume clustered relations to estimate

the costs of other physical operators.

Page 24: CS411 Database Systems

Classification of Physical Operators

• One-pass algorithms– Read the data only once from disk– Usually, require at least one of the input relations fit

in main memory

• Nested-Loop Join algorithms– Read one relation only once, while the other will be

read repeatedly from disk

• Two-pass algorithms– First pass: read data from disk, process it, write it to

the disk– Second pass: read the data for further processing

Page 25: CS411 Database Systems

One pass algorithms

Page 26: CS411 Database Systems

One-pass Algorithms

Selection (R), projection (R)

• Both are tuple-at-a-Time algorithms

• Cost: B(R)

Input buffer

Output buffer

Unaryoperator

Disk

Read a block

RB(R) blocks

Page 27: CS411 Database Systems

One-pass Algorithms

Duplicate elimination (R)

• Need to keep a dictionary in memory:– balanced search tree– hash table– etc

• Cost: B(R)

• Assumption: B((R)) <= M

R

Inputbuffer

Scanbefore?

M-1 buffersOutputbuffer

Page 28: CS411 Database Systems

Duplicate elimination R) when B((R)) <= M

R Inputbuffer Scan

before?

M-1 buffers

(Hash table)

Outputbuffer

B(R) = 6

T(R) = 12

Disk

M = 8

58 47 312

h(x) = x mod 7

1062

11

Cost: B(R)

0 1 2 3 4 5 6

8

7

5

3

6

11

5

4

4

12

10

2

5

5

8

8

4

4

7

7

Page 29: CS411 Database Systems

Grouping: city, sum(price) (R)

• Need to keep a dictionary in memory

• Also store the sum(price) for each city

• Cost: B(R)

• Assumption: number of cities fits in memory

Page 30: CS411 Database Systems

Binary Operations: R U S, R – S

• Assumption: min(B(R), B(S)) <= M• Scan a smaller table of R and S into main memory, then read

the other one block by one• Cost: B(R)+B(S)• Example: R ∩ S

– Read S into M-1 buffers and build a search structure– Read each block of R, and for each tuple t of R, see if t is also in

S. – If so, copy t to the output, and if not, ignore t

Page 31: CS411 Database Systems

Nested loop join

Page 32: CS411 Database Systems

Tuple-based Nested Loop Joins

• Join R S

for each tuple r in R do

for each tuple s in S do

if r and s join then output (r,s)

• Cost: T(R) T(S), or T(R) B(S) if R is clustered

Page 33: CS411 Database Systems

Block-based Nested Loop Joins

for each (M-1) blocks bs of S do

for each block br of R do

for each tuple s in bs do

for each tuple r in br do

if r and s join then output(r,s)

Page 34: CS411 Database Systems

Block-based Nested Loop Joins

. . .

. . .

R & S

Hash table for block of S(k < B-1 pages)

Input buffer for R Output buffer

. . .

Join Result

joined

tuples

Page 35: CS411 Database Systems

Block-based Nested Loop Joins

• Cost:– Read S once: cost B(S)– Outer loop runs B(S)/(M-1) times, and each time

need to read R: costs B(S)B(R)/(M-1)– Total cost: B(S) + B(S)B(R)/(M-1)

• Notice: it is better to iterate over the smaller relation first

• S R: S=outer relation, R=inner relation