Select

Query Tree Question

• Should we do a pname, pnumber then pname = ‘Aquarius’ then pnumber ?

• No, since the operations are done together– the processor would read a row of project,

see if pname = ‘Aquarius’ then use pnumber to perform the join.

• Our query tree has only 2 groups, not 3

Select Operation Strategies

And Indexing

(Chapter 8)

*Some info on slides from Dr. S. Son, U. Va

Disk access

• DBs traditionally stored on disk

• Cheaper to store on disk than in memory

• Costs for:– Seek time, latency, data transfer time

• Disk access is page oriented

• 2 - 4 KB page size

Access time

• Time to randomly access a page :– 12-20 ms which is 50-83 I/O's per second

• Large disparity between disk access and memory access (10-200 ns)

• System initially determines if page in memory buffer (page tables, etc.)

Table scan

• Linear search - all data rows read in – I/O parallelism can be used

• multiple I/O read requests satisfied at the same time

• stripe the data across different disks

– Problems with parallelism?• must balance disk arm load to gain maximum

parallelism • requires the same total number of random I/O's,

but using devices for a shorter time

Sequential prefetch I/O

• Retrieve one disk page after another (on same track) - typically 32

• Seek time no longer a problem

• Must know in advance to read 32 successive pages

• Speed up of I/O by a factor of ≈10 (500 I/O's per second vs. 70)

Access time

• Seek time – 10-15ms

• Latency time – 2-5 ms

• Data transfer time – 10-200 ns

Access time for fast I/O

RIO Seq. Prefetch .010 .010 Seek - disk arm to cylinder .002 .002 Latency - platter to sector .0015 .048 Data transfer - Page .0135 .060 1 page vs. 32 pages

.43 seconds .060 seconds for 32 pages for both

Textbook access time

RIO Seq. Prefetch .008 .008 Seek - disk arm to cylinder .004 .004 Latency - platter to sector .0005 .016 Data transfer - Page .0125 .028 1 page vs. 32 pages

.40 seconds .028 seconds for 32 pages for both

Disk allocation

• Disk Resource Allocation for Databases (DBA has control)

• Goal – contiguous sectors on disk - want data as close together as possible to minimize seek time

• No standard SQL approach, but general way to deal with allocation

• Some OS allow specification of size of file and disk device

Tablespace

• Allocation medium for tables and indexes for ORACLE, DB2, etc.

• Usually relations (files) cannot span disk devices • Can put >1 table in a table space if accessed

together • Tablespace corresponds to 1 or more OS files

and can span disk devices

Query Language

• ORACLE DB's contain several tablespaces, including one called system - data description + indexes + user-defined tables

Create tablespace tspace1 datafile 'fname1', 'fname2';

• default tablespace given to each user • if multiple tablespaces - better control over load

balancing • can take some disk space off-line

Extent

• extent - contiguous storage on disk • when data segment or index segment first created,

given an initial extent from tablespace 10KB (5 pages) • if need more space given next contiguous extent

• can increase the size by a positive % (cannot decrease) initial n - size of initial extent next n - size of next max extents - maximum number of extents min extents - number of extents initially

allocated pct increase n - % by which next extent

grows over previous one

Create table

• Create table statement - can specify tablespace, no. of extents– When initial extent full, new extent allocated

• pctfree - determine how much space can be used for inserts of new rows – if pctfree =10%, inserts stop when page is 90% full

• pctused – determines when new inserts start again – if fall below certain percentage of total, default

pctused = 40% pctfree + pctused < 100

Rows

• Row layout on each disk page (see figure) • Row directory – row number and page byte

offset– Row number is row number in page – book calls it

slot#– Page byte offset – with varchar, row size not constant

• To identify a particular row use RID (RowID) –

page #, slot # [file#]

slot# is number in row directory (logical #)

Differences in DBMSs

• RID can be retrieved in ORACLE but not DB2 (violates relational model rule)

• ORACLE • rows can be slit between pages (row record

fragmentation) • Can have rows from multiple tables on same page,

more info

• DB2, no splitting, entire row moved to new page, need forwarding pointer

Binary Search

• ``Find all students with gpa > 3.0’’– If data is in sorted file, do binary search to find

first such student, then scan to find others.– Cost of binary search can be quite high.

• Simple idea: Create an `index’ file.

Page 1 Page 2 Page NPage 3 Data File

k2 kNk1 Index File

Binary Search

• Binary search on disk – optimal for comparisons - not optimal for disk-

based look-up – must keep data in order – may be reading values from same page at

different times

• Instead use B+-tree index

Indexing

• Keyed access retrieval method • index is a sorted file - sorted by index key • index entries:

index key pointer (RID)

• pointer is RID • index resides on disk, partially memory resident when

accessed

Indexing

• As for any index, 3 alternatives for data entries k*:– Data record with key value k– <k, rid of data record with search key value k>– <k, list of rids of data records with search key k>

• Choice is orthogonal to the indexing technique used to locate data entries k*.

• Tree-structured indexing techniques support both range searches and equality searches.

• B+ tree: dynamic, adjusts gracefully under inserts and deletes.

B+-tree

• Most commonly used index structure type in DBs today

• Based on B-tree

• Used to minimize disk I/O

• available in DB2, ORACLE also has hash cluster, Ingres has heap structure, B-tree, isam (chain together new nodes) Example

Structure of B+ Trees

• leaf level pointers to data (RIDs)

• the remaining are directory (index) nodes that point to other index nodes

Index Entries

Data Entries("Sequence set")

(Direct search)

Characteristics of B+ Tree

• Insert/delete at log F N cost; keep tree height-balanced. (F = fanout, N = # leaf pages)

• Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries. The parameter d is called the order of the tree.

• Supports equality and range-searches efficiently

Cost of I/O for B+-tree

• Assume number of entries in each index node fits on one page - one node is one page

• If tree with depth of 3, 3 I/Os to get pointer to data B+-tree structured to get most out of every disk page read

• Read in index node, can make multiple probes to same page if remains in memory

– likely since frequent access to upper -level nodes of actively used B+-trees

B+ Trees in Practice

• Typical order: 100. Typical fill-factor: 67%.– average fanout = 133

• Typical capacities:– Height 4: 1334 = 312,900,700 records– Height 3: 1333 = 2,352,637 records

• Can often hold top levels in buffer pool:– Level 1 = 1 page = 8 Kbytes– Level 2 = 133 pages = 1 Mbyte– Level 3 = 17,689 pages = 133 MBytes

B+-tree

• Index has a directory structure that allows retrieval of a range of values efficiently – search for leftmost index entry Si such that

X <= Si • Index entries always placed in sequence

by value - can use sequential prefetch on index

• Index entries shorter than data rows and require proportionately less I/O

B+-tree

• Balancing of B+-trees - insert, delete

• nodes usually not full

• utilities to reorganize to lower disk I/O

• most systems allow nodes to become depopulated- no automatic algorithm to balance

• average node below root level 71% full in active growing B+-trees

Inserting into B+ Tree

• Find correct leaf L. • Put data entry onto L.

– If L has enough space, done!– Else, must split L (into L and a new node L2)

• Redistribute entries evenly, copy up middle key.• Insert index entry pointing to L2 into parent of L.

• This can happen recursively– To split index node, redistribute entries evenly, but

push up middle key. (Contrast with leaf splits.)• Splits “grow” tree; root split increases height.

– Tree growth: gets wider or one level taller at top.– Algorithm from MSU

http://www.cs.msstate.edu/~cs2314/global/BTreeAnimation/insert.htm

Deleting from B+ tree

• Start at root, find leaf L where entry belongs.• Remove the entry.

– If L is at least half-full, done! – If L has only d-1 entries,

• Try to re-distribute, borrowing from sibling (adjacent node with same parent as L).

• If re-distribution fails, merge L and sibling.

• If merge occurred, must delete entry (pointing to L or sibling) from parent of L.

• Merge could propagate to root, decreasing height.• Algorithm from MSU

http://www.cs.msstate.edu/~cs2314/global/BTreeAnimation/delete.htm

Duplicate key values

• Duplicate key values in index • leaf nodes have sibling pointers • but a delete of a row that has a heavily

duplicated key entails a long search through the leaf-level of the B+-tree

• Index compression - with multiple duplicates | header info | PrX keyval RID RID ... RID | PrX keyval RID…RID|

where PrX is count of RID values

Create Index

Options: multiple columns

tablespace storage - initial extents, etc. percent free default = 10

% of each page left unfilled free page (1 free page for every n

index pages) Can control % of B+-tree node pages left

unfilled when index created, refers to initial creation

Clustering

• Placing rows on disk in order by some common index key value (remember the index itself is always sorted)

• clustered (clustering) index - index with rows in the same order as the key values

• efficiency advantage read in a page, get all of the rows with

the same value • clustering is useful for range queries

e.g. between keyval1 and keyval2

Clustering

• can only cluster table by 1 clustering index at a time

• In DB2 – – if the table is empty, rows sorted as placed on

disk – subsequent insertions not clustered, must use

REORG

Indexes vs. table scan

• To illustrate the difference between table scan, secondary index (non clustered) and clustered index

Assume 10 M customers, 200 cities2KB/page, row = 100 bytes, 20 rows/page Select *

From Customers Where city = Birmingham

1/200 * 10M if assume selectivity = 1/200 50,000 customers in a city

Table Scan

Table Scan - read entire table

10,000,000/20 = 500,000 pages

If use prefetch?

500000/32 * .? =

Clustering Index

• Clustering Index –

• All entries for B'ham clustered on same pages

• 50,000/20 = 2500 pages (with 20 rows per page)

• (3 + 50 + 2500)*?=

Secondary Index

Secondary Index–

• In the worst case 1 entry for B'ham per page • 50,000 pages (10M/200)• 3 upper nodes of the tree • Assume 1000 index entries per leaf node, read

50000/1000 index pages • (3 + 50 + 50,000)*?=

List Prefetch

• Create list of data pages to access

• system orders pages to minimize disk I/O– E.g. elevator algorithm for disk request

scheduling

% Free

• Redo the previous calculations assuming relations created with 50% free option specified.

Multiple Indexes

• More than one index on a relation – e.g. class - one index, gender - one index

Composite Index

• One index based on more than one attribute Create Index index_name on Table (col1, col2,... coln)

• Composite index entry - values for each attribute class, gender entry in index is: C1, C2, RID

• What would B+ tree look like?

Creating Indexes

When determining what indexes to create consider: workload - mix of queries and frequencies of requests 20% of requests are updates, etc.

can create lots of indexes but: cost to create insertions initial load time high if a large table index entries can become longer and

longer as multiple columns included

Select

Documents

disk page

page n page

data transfer page

page size

new page

data transfer time disk

disk space

disk devices