Modern Database Systems - Lecture 01

Modern Database Systems Lecture 1

Aristides Gionis Michael Mathioudakis

T.A.: Orestis Kostakis

Spring 2016

logistics

assignment will be up by Monday (you will receive email)

due Feb 12th

if you’re not registered... I will post material (slides and assignments) also at

http://michalis.co/moderndb/

2

in this lecture...

review past material relational model and sql

storage and indexing access cost analysis

hash index b+ tree

3

relational model and SQL

relational model and sql

what is the relational model? tabular representation of data

why do we study it?

supports simple and intuitive querying good for educational purposes

most widely used

5

definitions relational database

a set of relations

relation

example!

schema name of relation + name and

type of each field fields as columns

instance a table with rows and columns

6

example relation: students

cardinality (number of rows) = 3, degree (number of fields/columns) = 5 > can we have the same value twice in the same column? schema students(sid: integer, name: string, username: string, age: integer, gpa: real)

sid name username age gpa

53666 Sam Jones jones 22 3.4

53688 Alice Smith smith 22 3.8

53650 Jon Edwards jon 23 2.4

7

querying major strength of relational model

simple, intuitive, precise querying of data

the DBMS is responsible for efficient evaluation

Standard Query Language (SQL) the standard language for relational queries

developed by IBM in the 1970s was standardized in 1986

latest standard in 2011

example!

8

example SQL query

to find student records of age 23 SELECT * FROM students WHERE age=23

to find just names and usernames SELECT name, username FROM students WHERE age=23


53666 Kate Jones jones 22 3.4


53650 Jon Edward jon 23 2.4


53650 Jon Edward jon 23 2.4

name username

Jon Edward jon

9

creating, altering, and destroying, relations in SQL

CREATE TABLE students (sid CHAR(20), name CHAR(20), username CHAR(10), age INTEGER, gpa REAL);

the type of each column is enforced by the DBMS

DROP TABLE students;

ALTER TABLE students ADD COLUMN firstYear integer;

every tuple in the current instance is extended with a null value in the new column

CREATE TABLE course (sid CHAR(20), points integer, grade CHAR(2));

destroy relation students (schema and instance) 10

adding and deleting tuples > what do the following statements do?

INSERT INTO students(sid, name, username, age, gpa) VALUES (12345, “Kate Doe”, “kate”, 23, 4.0);

DELETE FROM students WHERE name = ‘Jane Smith’;

11

candidate keys a set of fields is a candidate key (aka ‘key’) for a relation if... 1)  distinct tuples cannot have same values in all key fields, and

2)  this is not true for any subset of the key

if only part (1) from above is true... we have a superkey

possibly many candidate keys for a relation DBMS admin chooses one (1) of them as primary key

an integrity constraint

condition must be true for any instance of the database other integrity constraints?

12

candidate keys

in SQL, use PRIMARY KEY to specify primary key UNIQUE to specify candidate keys

example

relation enrolled holds information about student enrollment to courses compare the following ‘create table’ statements

use ICs carefully - they might forbid database instances that could arise in practice

CREATE TABLE Enrolled (sid CHAR(20), cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid,cid))

CREATE TABLE Enrolled (sid CHAR(20) cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid), UNIQUE (cid, grade)) 13

storage and indexing

14

storage

setting the DBMS uses disks as external storage to store relations into files of records

disks retrieve random page at fixed cost cheaper to retrieve several consecutive pages than each by random access

why?

file organization method of arranging a file of records on external storage

record: one row of a relation record is internally assigned a record id (rid)

rid is sufficient to physically locate record (address)

15

alternative file organizations heap files random order

suitable when typical access is a file scan to retrieve all records

sorted files records are sorted - typically by column value(s)

suitable if records must be retrieved by same order

indexes data structures that allow organized access to records…

... via search keys - typically column value(s) updates are faster than in sorted files -- why?

16

data structures that allow us to find rids of records with specified column values any subset of the columns of a relation can be the search key for an index search key is not same as primary / candidate key

indexes an index contains a collection of data entries supports efficient retrieval of data entries k*

with a given key value k

index entries

data entries

data records

index file

data file 17

types of data entries

three alternatives 1.  data record with key value k

2. (k, rid of data record with search key k) 3. (k, list of rids of data records with search key k)

type of data entries is orthogonal to index structure

example of index structure B+ trees or hash tables

18

data entries of type 1 index structure is a file

organization for data records we just have an ‘index file’

index entries

data records

index file

> how many indexes of a relation can be of type 1?

19

types of data entries - types 2 & 3 data entries typically much smaller than data records

> why?

index entries

data entries

data records

index file

data file

type 3 is more compact than type 2 > why?

20

index classes primary vs secondary

primary: if search key contains a primary key unique index: search key contains a candidate key

clustered vs unclustered

if order of data records is same as that of data entries makes big difference for some queries!

> can alternative 1 indexes be unclustered?

unclustered clustered

21

hash-based indexes

retrieve records with exactly specified search-key values suitable for equality queries index is collection of buckets bucket = 1 or more disk pages

hashing function h

h(r) = bucket where record r belongs, based on its column values

data entries are … ... type 1: the buckets contain data records

... type 2 or 3: the buckets contain (key, rid) or (key, rids) pairs

22

hash-based indexes

Smith, 44, 3000

Jones, 40, 6003

Tracy, 44, 5004

Ashby, 25, 3000

Basu, 33, 4003

Kate, 29, 2007

Cass, 50, 5004

Basu, 33, 6003

age h1

relation employes(name CHAR(100), age INTEGER, salary INTEGER)

3000

3000

5004

5004

4003

2007

6003

6003

salary h2

clustered (type 1) hash index on age unclustered (type 2) hash index on salary 23

leaf pages contain data entries, and are chained (prev & next) non-leaf pages have index entries; only used to direct searches

P 0 K 1 P 1 K 2 P 2 K m P m

index entry

b+ tree indexes

non-leaf pages

leaf pages

(sorted by search key)

24

example b+ tree

find 28*? 29*? all > 15* and < 30*?

insert/delete find data entry in leaf, then update it

need to adjust parent sometimes change sometimes bubbles up the tree

2* 3*

root

17

30

14* 16* 33* 34* 38* 39*

13 5

7* 5* 8* 22* 24*

27

27* 29*

entries < 17 entries >= 17

note that data entries in leaf level are sorted

access-cost analysis

26

access-cost model ● relation students

○ B: number of data pages, R: number of records per page ● execute typical select-from-where query

○ D: (average) time to read or write one disk page

SELECT * FROM students WHERE <...>

● estimate running time of query

○  ignore cpu costs ○ number of disk accesses (read/writes) is the bottleneck

27

file organizations heap file (random order; inserts at eof)

sorted file, sorted on <age, gpa>

clustered B+ tree file (type 1 data entries) on

search key <age, gpa>

heap file with unclustered B+ tree index on search key <age, gpa>

heaf file with unclustered hash index on

search key <age, gpa>

28

queries to compare

insert record

SELECT * FROM students

SELECT * FROM students WHERE age = 22 and gpa = 4.0

SELECT * FROM student WHERE age >= 20

INSERT INTO STUDENTS (sid, name, username, age, gpa) VALUES (12345, “Michael”, “mike”, 32, 2.6)

scan - fetch all records

equality search

range search

29

cost analysis

what is the estimated time for each query to run?

under simplified model how many disk pages are accessed?

time = #disk-accesses x D

30

cost analysis

scan equality range insert

heap

sorted

clustered

unclustered b+ tree

unclustered hash

31

heap file

operation cost and explanation

scan B; simply retrieve all pages

equality search

B in worst case; if we know that exactly one such record exists, the cost is 0.5B in expectation

range search B; must retrieve all records

insert 2; fetch and store back the last page of the file

32

sorted file


scan B; simply retrieve all pages

equality search log2B + #qualifying-pages; since the condition matches the index, we can find the page of the record with binary search that retrieves log2B pages; if more than one records qualify, retrieve sequentially #qualifying-pages after the first

range search log2B + #qualifying-pages; as above, log2B pages are retrieved to find the first matching record, followed possibly by a number (#qualifying-pages) of pages with qualifying records

insert log2B + B; find the position of the record in the file (log2B); then, read the second half of the file, insert the record, write the second half back (0.5B + 0.5B in expectation)

33

clustered b+ tree


scan 1.5B; simply retrieve all record pages

equality search logF1.5B + #qualifying-pages; find the first qualifying record and retrieve consecutive qualifying ones

range search logF1.5B + #qualifying-pages; find the first qualifying record and retrieve consecutive qualifying ones

insert logF1.5B + 1; search for record page (logF1.5B) and add record to it (1)

assumptions: 2/3 = 67% occupancy of record pages, i.e. 1.5B record pages; fanout F

34

unclustered b+ tree


scan B(R+0.15); scan the leaf level of the index (0.15B); for each data entry, fetch the page with the corresponding data record (6.7R x 0.15B = BR)

equality search logF0.15B + #qualifying-records; locate the first data entry (logF0.15B) and do one disk access for every qualifying record (#qualifying-records)

range search logF0.15B + #qualifying-records; locate the first data entry (logF0.15B) and do one disk access for every qualifying record (#qualifying-records)

insert 3 + logF0.15B;insert at end of heap file (2), find page for data entry (logF0.15B) and update it (1)

assumptions: the size of one data entry is 10% the size of one record; also, index pages have 2/3=67% occupancy; therefore, number of index leaf pages is 0.1*1.5B = 0.15B and number of data entries in one page are 10*0.67R = 6.7R

35

unclustered hash index


scan B(R+0.125); retrieve pages that contain data entries (0.125B); for each data entry, fetch the page with the corresponding data record

equality search 2; retrieve page with data entry (1) and page with data record (1)

range search 0.125B + #qualifying-records; the hash index offers no help - scan index (0.125B) and retrieve pages of matching records; typically it’s better to scan entire heapfile (B)

insert 4; insert record into heap file (1 read+1 write); insert record into hash index (1 read + 1 write)

assumptions: the size of one data entry is 10% the size of one record; static hashing, no overflow pages (one bucket is one page); 4/5 = 80% occupancy; therefore , 0.1*1.25B = 0.125B pages for data entries and the number of data entries in a page is 10*0.8R = 8R

36

cost analysis scan equality range insert

heap B B B 2

sorted B log2B + #qualifying-pages

log2B + #qualifying-pages

log2B + B

clustered 1.5B logF1.5B + #qualifying-pages

logF1.5B + #qualifying-pages

logF1.5B + 1

unclustered b+ tree

B(R+0.15) logF0.15B + #qualifying-records

logF0.15B + #qualifying-records

3 + logF0.15B

unclustered hash

B(R+0.125) 2 0.125B + #qualifying-records

4

note we made several assumptions to obtain these numbers 37

the morale

different queries have different cost for different file organizations

> how would you use this analysis as a db admin?

discuss

38

the morale

know your workload what queries? how often?

on what relations? what file organizations? what indexes would speed-up response times for your workload?

hint: see WHERE clause for index key candidates

why?

what trade-offs will you face? hint: queries are faster but updates take time, index takes space

we’ll see more complex cases in ‘query optimization’

39

indexes with composite search keys

composite search keys search on a combination of fields

equality query

every field value is equal to a constant e.g., age=20 and sal =75, wrt <sal,age> index

range query

some field value is not a constant e.g., age =20; or age=20 and sal > 10, wrt <sal,age> index

data entries in index sorted by

search key to support range queries (e.g., b+ trees) <sal, age>

<age>

<sal>

data records sorted by name

data entries sorted by <sal,age>

data entries sorted by <sal>

examples of composite key indexes

11,80

12,10

12,20

13,75

10,12

20,12

75,13

80,11

11

12

12

13

10

20

75

80

name age sal

bob 12 10

cal 11 80

joe 12 20

sue 13 75

<age,sal>

remember also composite indexes are larger,

updated more often 40

composite search keys

if condition is: 3000<sal<5000: <age,sal> index does not help! why?

because the index does not match the selection condition

index matches selection (condition ∧ ... ∧ ... ∧ condition) when:for hash index: only equality conditions for all fields

for tree index: includes equality or range condition for a prefix of the search key

41

to retrieve employee records with age=30 AND sal=4000, an index on <age,sal> or <sal, age> would be better than

an index on <age> or an index on <sal>

if condition is: age=30 AND 3000<sal<5000: <age,sal> index much better than <sal,age> index! why?

hint: allows us to allocate answer with contiguous data entries order can make a difference depending on the selectivity of each condition

if condition is: 20<age<30 AND 3000<sal<5000: tree index on <age,sal> or <sal,age> make no difference

if selectivity of each condition is the same

composite search keys

42

index-only plans

some queries can be answered without retrieving any data records

if a suitable index is available

example employees

(name CHAR(100), depnum INTEGER, age INTEGER, salary INTEGER)

SELECT depnum, COUNT(*)FROM employeesGROUP BY depnum

SELECT AVG(salary)FROM employeesWHERE age=25 ANDsalary BETWEEN 3000 AND 5000

index on <depnum>

b+ tree index on <age,salary>

43

index-only plans are possible with both <dno,age> or <age,dno>

tree index <age, dno> is better

why?

SELECT E.dno, COUNT (*)

FROM Emp E

WHERE E.age=30

GROUP BY E.dno

index-only plans

44

summary

45

summary ●  relational model and SQL

○  tabular representation ■  one record per row ■  schema determines names and types of columns

○  simple, intuitive querying language ■  statements to select records that satisfy a condition ■  specify columns to project ■  statements to insert and delete tuples

46

●  storage ○  a DBMS might use different file organizations to store relations ○  heap file, sorted file, index ○  different queries have different access costs

for different file organizations ○  having the right index can make a big difference in execution time

●  commonly used indexes ○  B+ tree and hash-based index

next

b+ trees and hash-based index

external sorting

joins

query optimization

47

references ●  “cowbook”, database management systems, by ramakrishnan and gehrke ●  “elmasri”, fundamentals of database systems, elmasri and navathe ●  other database textbooks

●  disk access analysis ○  cowbook, chapter 8

●  b+ tree and hashing algorithms ○  elmasri

■  section 18.2: hash indexes ■  section 18.3.2: b+ trees

○  cowbook ■  chapters 10 and 11

48

credits

slides based on material from database management systems, by ramakrishnan and gehrke

49

joins sid name username age gpa

53666 Sam Jones jones 22 3.4


53650 Jon Edwards jon 23 2.4

students

sid points grade

53666 92 A

53688 35 D

53650 65 C

course

what does this compute?

SELECT S.name, C.grade FROM Students S,Course C WHERE S.sid = C.sid AND

C.points > 60

S.name C.grade

Sam Jones A

Jon Edwards C 50

index-only plans

SELECT E.dno, COUNT (*)

FROM Emp E

WHERE E.age>30

GROUP BY E.dno

what if we consider the second query? we’ll come back to this after external sorting

Modern Database Systems - Lecture 01

Education