Top Banner
7/23/2019 ArchDBMS-2-2x2 http://slidepdf.com/reader/full/archdbms-2-2x2 1/16 Module 2: Storing Data: Disks and Files Module Outline 2.1 Memory hierarchy 2.2 Disk space management 2.3 Buffer manager 2.4 File and record organization 2.5 Page formats 2.6 Record formats 2.7 Addressing schemes Files and Index Structures Buffer Manager Disk Space Manager Recovery Manager Plan Executor Operator Evaluator Optimizer Parser Applications Web Forms SQL Interface SQL Commands Query Processor Concurrency Control DBMS Database Index Files Data Files System Catalog Transaction Manager Lock Manager 13 2.1 Memory hierarchy  Memory in off-the-shelf computer systems is arranged in a  hierarchy: Request Storage Class & CPU & CPU Cache (L1, L2) primary & Main Memory (RAM) & M ag ne tic D is k s ec on dar y Tape, CD-ROM, DVD tertiary  Cost of primary memory 100 × cost of secondary storage space of the same size.  Size of address space  in primary memory (e.g., 2 32 Byte = 4 GB) may not be sufficient to map the whole database (we might even have   2 32 records).  DBMS needs to make  data persistent  across DBMS (or host) shutdowns or crashes; only secondary/tertiary storage is nonvolatile. DBMS needs to bring in data from lower levels in memory hierarchy as needed for processing. 14 2.1.1 Magnetic disks  Tapes  store vast amounts of data (  100 GB; more for roboter tape farms) but they are  sequential  devices.  Magnetic disks  ( hard disks ) allow  direct access  to any desired location; hard disks dominate database system scenarios by far. rm movement rotation platter cylinder track disk head disk arm 1  Data on a hard disk is arranged in con- centric rings (tracks) on one or more platters, 2  tracks can be recorded on one or both surfaces of a platter, 3  set of tracks with same diameter form a  cylinder, 4  an array (disk arm) of  disk heads, one per recorded surface, is moved as a unit, 5  a stepper motor moves the disk heads from track to track, the platters steadily rotate. 15 track sector block 1  Each track is divided into arc-shaped sectors  (a characteristic of the disk’s hardware), 2  data is written to and read from disk block by block  (the block size is set to a multiple of the sector size when the disk is formatted), 3  typical disk block sizes are 4 KB or 8 KB. Data blocks can only be written and read if disk heads and platters are posi- tioned accordingly.  This has implications on the  disk access time: 1  Disk heads have to be moved to desired track ( seek time), 2  disk controller waits for desired block to rotate under disk head (rotational delay), 3  disk block data has to be actually written/read ( transfer time). access time =  1 +  2 +  3 16
16

ArchDBMS-2-2x2

Feb 18, 2018

Download

Documents

nagarajuvcc123
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ArchDBMS-2-2x2

7/23/2019 ArchDBMS-2-2x2

http://slidepdf.com/reader/full/archdbms-2-2x2 1/16

Module 2: Storing Data: Disks and Files

Module Outline

2.1 Memory hierarchy

2.2 Disk space management

2.3 Buffer manager

2.4 File and record organization

2.5 Page formats

2.6 Record formats

2.7 Addressing schemes

Files and Index Structures

Buffer Manager

Disk Space Manager

RecoveryManager

Plan Executor

Operator Evaluator Optimizer

Parser

ApplicationsWeb Forms SQL Interface

SQL Commands

Query Processor

Concurrency ControlDBMS

Database

Index Files

Data FilesSystem Catalog

TransactionManager

LockManager

13

2.1 Memory hierarchy

 Memory in off-the-shelf computer systems is arranged in a  hierarchy:

Request Storage Class

&

CPU

&

CPU Cache (L1, L2)

ffprimary

&

Main Memory (RAM)

&

M ag ne tic D is k s ec on dar y

Tape, CD-ROM, DVD tertiary

 Cost of primary memory ≈ 100 × cost of secondary storage space of the same

size.

 Size of address space  in primary memory (e.g., 232 Byte = 4 GB) may not be

sufficient to map the whole database (we might even have   232 records).

  DBMS needs to make   data persistent  across DBMS (or host) shutdowns or

crashes; only secondary/tertiary storage is nonvolatile.

DBMS needs to bring in data from lower levels in memory hierarchy

as needed for processing.14

2.1.1 Magnetic disks

  Tapes  store vast amounts of data (  100 GB; more for roboter tape farms)

but they are  sequential  devices.

  Magnetic disks  (hard disks ) allow   direct access   to any desired location; hard

disks dominate database system scenarios by far.

rm movement

rotation

platter

cylinder

track

disk head

disk arm

1  Data on a hard disk is arranged in con-

centric rings (tracks) on one or more

platters,

2   tracks can be recorded on one or bothsurfaces of a platter,

3   set of tracks with same diameter form

a  cylinder,

4   an array (disk arm) of   disk heads, one

per recorded surface, is moved as a unit,

5   a stepper motor moves the disk heads

from track to track, the platters steadily

rotate.

15

track sector

block

1   Each track is divided into arc-shaped

sectors   (a characteristic of the disk’s

hardware),

2   data is written to and read from disk

block by block  (the block size is set to

a multiple of the sector size when the

disk is formatted),

3   typical disk block sizes are 4 KB or

8 KB.

Data blocks can only be written and read if disk heads and platters are posi-

tioned accordingly.

  This has implications on the   disk access time:

1   Disk heads have to be moved to desired track (seek time),

2   disk controller waits for desired block to rotate under disk head (rotational

delay),

3   disk block data has to be actually written/read ( transfer time).

access time =   1 +   2 +   316

Page 2: ArchDBMS-2-2x2

7/23/2019 ArchDBMS-2-2x2

http://slidepdf.com/reader/full/archdbms-2-2x2 2/16

I Access time for the IBM Deskstar 14GPX

  3.5 inch hard disk, 14.4 GB capacity

 5 platters of 3.35 GB of user data each, platters rotate at 7200/min

  average seek time 9.1 ms (min: 2.2 ms [track-to-track], max: 15.5

ms)

  average rotational delay 4.17 ms

  data transfer rate 13 MB/s

access time 8 KB block  ≈ 9.1 ms + 4.17 ms +  1 s

13 MB/8 KB ≈ 13.87 ms

N.B. Accessing a main memory location typically takes  <  60 ns .

17

  The unit of a data transfer  between disk and main memory is a block,

  if a single item (e.g., record, attribute) is needed, the   whole containing block

must be transferred:

Reading or writing a disk block is called an   I/O operation.

The time for I/O operations dominates the time taken for database oper-

ations.

  DBMSs take the geometry and mechanics of hard disks into account.

  Current disk designs can transfer a whole track in one platter revolution,

active disk head can be switched after each revolution.

  This implies a closeness measure  for data records  r 1, r 2  on disk:

1   Place r 1  and  r 2   inside the same block  (single I/O operation!),

2   place  r 2   inside a block adjacent to  r 1’s block on the   same track,

3   place  r 2  in a block somewhere on  r 1’s track,

4   place  r 2  in a track of the   same cylinder  than r 1’s track,

5   place  r 2   in a cyclinder adjacent to  r 1’s cylinder.18

2.1.2 Accelerating Disk-I/O

Goals

  reduce number of I/Os:

  DBMS buffer, physical DB-design

  reduce duration of I/Os:

  access neighboring disk blocks (clustering) and bulk-I/O:

  advantage:   optimized seek time, optimized rotational delay, minimized overhead

(e.g.   interrupt handling)

 disadvantage:  I/O path busy for a long time (concurrency!)

  bulk I/Os can be implemented on top of or inside disk controller

. . .  used for mid-sized data access (prefetching, sector-buffering)

  different I/O paths (declustering) with parallel access

  advantage:   parallel I/Os, minimized transfer time by multiple bandwidth

 disadvantage:   avg. seek time and rotational delay increased, more hardware

needed, blocking of parallel transactions

 advanced hardware or disk arrays (RAID systems)

. . .  used for large-size data access19

Performance gains withs parallel I/Os

Partition files into equally sized areas of consecutive blocks (“striping”)

 

intra-I/O parallelism inter-I/O parallelism

Striping unit (# logically consecutive bytes on one disk) determines   degree of 

parallelism for single I/O  and   degree of parallelism  between different  I/Os:

  small chunks:  high intra-access parallelism but many devices busy

⇒  not many I/Os in parallel

  large chunks:   low intra-access parallelism, but many I/Os in parallel20

Page 3: ArchDBMS-2-2x2

7/23/2019 ArchDBMS-2-2x2

http://slidepdf.com/reader/full/archdbms-2-2x2 3/16

RAID-Systems: Improving Availability

  Goal:  maximize availability  . . .  =  M T T F  

M T T F   +  M T T R

where:   M T T F  =mean time to failure,   M T T R=mean time to repair

  Problem:   with  N  disks, we have  N   times higher probability for problems!

Thus:   MT T D L =  M T T F  

where   MT T D L=mean time to data loss

  Solution:   RAID=Redundant array of inexpensive (independent) disks

Now we get   MT T D L =  M T T F  

N   ·

  M T T F  

(N − 1) · M T T R

i.e., we only suffer from data loss, if a  second disk fails before the first failed disk has

been replaced.

21

Principle of Operation

Use data redundancy to be able to   reconstruct   lost data,   e.g., compute parity

information during  normal operation

. . .  here ⊕  denotes logical “xor” (exclusive or)

When one of the disks fails, use parity to reconstruct lost data during  failure re-

covery

. . .   typically reserve one extra disk as a “hot spare” to replace the failed one

immediately.22

Executing I/Os from/to a RAID System

  Read Access: to read block number  k   from disk  j , execute a

read(B j 

k , disk j )-Operation.

  Write Access: to write block number k  back to disk  j , we have to update theparity information, too (let  p  be the number of the parity disk for block  k ):

 ∀i = j  :  read(Bi k , diski );

compute new parity block Bp 

k    from contents of all  Bi k ;

write(B

 j 

k , disk j );

write(Bp 

k  , diskp );

we can do better (i.e., more efficient), though:

read(Bp 

k , diskp );

compute new parity block  Bp 

k    :=  Bp 

k  ⊕ B j 

k  ⊕ B j 

k ;

write(B j 

k  , disk j );

write(Bp 

k  , diskp );

23

  Write Access  to blocks  b  on all disks  i  = p :

compute new parity block  Bp 

k    from contents of all  Bi =p 

k    ;

 ∀i  :  write(B i k , diski );

  Reconstruction of block b  on a failed disk j  (let r  be the number of the replace-ment disk):

 ∀i = j  :  read(Bi k , diski );

reconstruct  B j 

k  as “parity” from all  Bi = j 

write(B j 

, diskr );

24

Page 4: ArchDBMS-2-2x2

7/23/2019 ArchDBMS-2-2x2

http://slidepdf.com/reader/full/archdbms-2-2x2 4/16

Recovery Strategies

  off-line: if a disk has failed, suspend normal I/O; reconstruct all blocks of the

failed disk and write them to the replacement disk;   only afterwards   resume

normal I/O traffic.

  on-line: (needs a hot spare disk) resume normal I/O immediately

 start reconstructing all blocks not yet reconstructed since the crash (in back-

ground);

 allow parallel “normal” writes:write the block to the replacement disk and update parity;

  allow parallel “normal” read I/O:

  if block has not yet been repaired:

reconstruct block

  if block has already been reconstructed:

read block from replacement disk or reconstruct it (load balancing decision)

N.B. we can even wait with all reconstruction until first “normal” read access!

25

2.1.3 RAID-Levels

There are a number of variants (→ RAID-Levels) differing w.r.t. the following char-

acteristics:

  striping unit (data interleaving)

  how to scatter (primary) data across disks?

 fine (bits/bytes) or coarse (blocks) grain?

  How to compute and distribute redundant information?

 what kind of redundant information (parity, ECCs)

  where to allocate redundant information (separate/few disks, all disks of the

array)

5 RAID-levels have been introduced, later more levels have been defined.

26

  RAID Level 0: no redundancy, just striping

  least storage overhead

  no extra effort for write-access

 not the best read-performance!

  RAID Level 1: mirroring

 doubles necessary storage space

  doubles write-access

 optimized read-performance due to alternative I/O path   RAID Level 2: memory-style ECC

 compute error-correcting codes for data of  n  disks

  store onto  n − 1 additional disks

 failure recovery: determine lost disk by using the  n − 1 extra disks; correct

(reconstruct) its contents from 1 of those

27

  RAID Level 3: bit-interleaved parity

 one parity disk suffices, since controller can easily identify faulty disk!

 distribute (primary) data bit-wise onto data disks

 read and write access goes to all disks, therefore, no inter-I/O parallelism, but

high bandwidth

  RAID Level 4: block-interleaved parity

  like RAID 3, but distribute data block-wise (variable block size)

 small read I/O goes to only one disk

  bottleneck: all write I/Os go to the one parity disk

  RAID Level 5: block-interleaved striped parity

  like RAID 4, but distribute parity blocks across all disks  →  load balancing

  best performance for small and large read as well as large write I/Os

 variants w.r.t. distribution of block

More recent levels combine aspects of the ones listed here, or add multiple parity

blocks, e.g.   RAID 6: two parity blocks per group.28

Page 5: ArchDBMS-2-2x2

7/23/2019 ArchDBMS-2-2x2

http://slidepdf.com/reader/full/archdbms-2-2x2 5/16

non-  

mirroring 

m s

bit-i p

block-i p

b i s p

p s

data interleavingon the byte level

data i on the block level

shading= redundant info

29

Parity groups

Parity is not necessarily computed across  all   disks within an array, it is possibile to

define parity groups (of same or different sizes).

disk 1

group 1

group 2

      .       .       . 

disk 2

group 1

group 2

      .       .       . 

disk 3

group 1

group 2

      .       .       . 

disk 4

parity 1

      .       .       . 

disk 5

parity 2

      .       .       . 

group 3group 3parity 3 group 3

group 4 parity 4 group 4 group 4

group 5 group 5 group 5parity 5

30

Selecting RAID levels

 RAID level 0:  improves overall performance at lowest cost; no provision against

data loss, best write performance, since no redundancy;

  RAID levels 0+1:  (aka. level 10) superior to level 1, main appl. area is small

storage subsystems, sometimes for write-intensive applications.

 RAID level 1:  most expensive version; typically serialize the two necessary I/Os

for writes to avoid data loss in case of power failures, etc.

  RAID levels 2 and 4:   are always inferior to levels 3 and 5, resp’ly. Level

3 appropriate for workloads with large requests for contiguous blocks; bad formany small requests of a single block.

  RAID level 5:   is a good general-purpose solution. Best performance (with

redundancy) for small and large read as well as large write requests.

 RAID level 6:  choice for higher level of reliability.

RAID logic can be implemented inside the disk subsystem/controller (“hardware

RAID”) or in OS (“software RAID”).

31

2.2 Disk space management

Y  o u  a r  e  h e r  e !  

Files and Index Structures

BufferManager

Disk Space Manager

RecoveryManager

Plan Executor

Operator Evaluator Optimizer

Parser

ApplicationsWeb Forms SQL Interface

SQL Commands

Query Processor

Concurrency ControlDBMS

Database

Index Files

Data FilesSystem Catalog

TransactionManager

Lock

Manager

  The   disk space manager  (DSM) encapsulates

the gory details of hard disk access for the

DBMS,

  the DSM talks to the disk controller and initiates

I/O operations,

 once a block has been brought in from disk it is

referred to as a  pagea.

 Sequences of data pages are mapped onto con-

tiguous sequences of blocks by the DSM.

  The DBMS issues   allocate/deallocate   and

read/write  commands to the DSM,

  which, internally, uses a mapping

block-# ↔ page-#

to keep track of page locations and block usage.

aDisk blocks and pages are of the same size.

32

Page 6: ArchDBMS-2-2x2

7/23/2019 ArchDBMS-2-2x2

http://slidepdf.com/reader/full/archdbms-2-2x2 6/16

2.2.1 Keeping track of free blocks

 During database (or table) creation it is likely that blocks indeed can be arranged

contiguously on disk.

  Subsequent deallocations and new allocations however will, in general, create

holes.

  To reclaim space that has been freed, the disk space manager either uses

  a  free block list:

1  keep a pointer to the first free block in a known location on disk,

2   when a block is no longer needed, append/prepend this block to the free

block list for future use,

3   next pointers may be stored in disk blocks themselves,

  or free block bitmap:

1  reserve a block whose bytes are interpreted bit-wise (bit  n  = 0: block  n  is

free),

2   toggle bit  n  whenever block  n   is (de-)allocated.

 Free block bitmaps allow for fast identification of contiguous sequences of 

free blocks.33

2.3 Buffer manager

Files and Index Structures

BufferManager

Disk Space Manager

RecoveryManager

Plan Executor

Operator Evaluator Optimizer

Parser

ApplicationsWeb Forms SQL Interface

SQL Commands

Query Processor

Concurrency ControlDBMS

Database

Index Files

Data FilesSystem Catalog

TransactionManager

LockManager

  Y o u  a

 r e   h e r e  !

Size of the database on secondary storage

size of avail. primary mem. to hold user data.

  To scan the entire pages of a 20 GB table

(SELECT   ∗   FROM . . . ), the DBMS needs to

1   bring in   pages as they are needed for in-

memory processing,2   overwrite (replace) such pages when they be-

come obsolete for query processing and new

pages require in-memory space.

  The   buffer manager   manages a collection of 

pages in a designated main memory area, the

buffer pool,

 once all slots (frames) in this pool have been oc-

cupied, the buffer manager uses a  replacement

policy  to decide which frame to overwrite when a

new page needs to be brought in.

34

N.B.  Simply overwriting a page in the buffer pool is  not   sufficient if this page has

been modified after it has been brought in (i.e., the page is so-called  dirty).

 pinPage / unpinPage

disk page

free frame

disk

buffer pool

main memory

database

35

Simple interface for a typical buffer manager

Indicate that page  p  is needed for further processing:

function pinPage(p ):

if  buffer pool contains  p   already   then

pinCount(p ) ← pinCount(p ) + 1;

return  address of frame for  p ;

select a victim frame  p  to be replaced using the replacement policy;

if dirty(p )   then

write  p  to disk;

read page  p  from disk into selected frame;

pinCount(p ) ← 1;

dirty(p ) ← false;

Indicate that page p   is no longer needed as well as whether  p  has been modified by

a transaction (d ):

function unpinPage(p, d ):

pinCount(p ) ← pinCount(p ) − 1;

dirty(p ) ← d ;

36

Page 7: ArchDBMS-2-2x2

7/23/2019 ArchDBMS-2-2x2

http://slidepdf.com/reader/full/archdbms-2-2x2 7/16

N.B.

  The   pinCount   of a page indicates how many “users” (e.g., transactions) are

working with that page,

  “clean” victim pages are  not   written back to disk,

  a call to  unpinPage  does not trigger any I/O operation, even if the  pinCount

for this page goes down to 0 (the page might become a suitable victim, though),

 a database transaction is required to properly “bracket” any page operation us-

ing  pinPage and  unpinPage,  i.e.

a ← pinPage(p );

. . .

read data (records) on page at

address  a ;

. . .

unpinPage(p, false);

or

a ← pinPage(p );

. . .

read and modify data (records)

on page at address  a ;

. . .

unpinPage(p, true);

 A buffer manager typically offers at least one more interface-call:   flushPage(p )

to force page p  (synchronously) back to disk (for transaction mgmt. purposes)37

Two strategic questions

1  How much pretious buffer space to allocate to each of the active transactions

(Buffer Allocation Problem)? Two principal approaches:

–   static  assignment

–   dynamic  assignment

2  Which page to replace when a new request arrives and the buffer is full (Page

Replacement Problem)? Again, two approaches can be followed:

– decide without knowledge on reference pattern

– presume knowledge on (expected) reference pattern

Additional complexity is introduced when we take into account that the DBMS may

manage “segments” of different page sizes:

  one buffer pool: good space utilization, but fragmentation problem

  many buffer pools: no fragmentation, but worse utilization; global replace-

ment/assignment strategy may get complicated

Possible solution could be to allow for  set-oriented   pinPages({p })-calls.38

2.3.1 Buffer allocation policies

Problem:  shall we allocate parts of the buffer pool to each transaction (TX) or let

the replacement strategy alone decide on who gets how much buffer space?

Properties  of a “local” policy:

⊕   one TX cannot hurt others

⊕   TXs are treated equally

  possibly bad overall utilization of buffer space

  some TXs may have vast amounts of buffer space occupied by “old” pages, while

others experience “internal page thrashing”,  i.e., suffer from too little space

Problem  with a “global” policy:

  Consider a TX executing a sequential read on a huge relation:

  all page accesses are references to newly loaded pages;   hence, almost all other pages are likely be replaced (following a standard

replacement strategy);   other TXs cannot proceed without loading in their pages again (“external

page thrashing”).39

Typical allocation strategies include:

  global  – one buffer pool for all transactions

  local  – based on different kinds of data (e.g., catalog, index, data, .. . )

  local   – each transaction gets a certain fraction of the buffer pool:

 static partitioning  – assign buffer budget once for each TX

 dynamic partitioning  – adjust a TX’s buffer budget according to

  its past reference pattern

  some kind of semantic information

It is also possible to apply mixed strategies,  e.g., have different pools working with

different approaches. This complicates matters significantly, though.

40

Page 8: ArchDBMS-2-2x2

7/23/2019 ArchDBMS-2-2x2

http://slidepdf.com/reader/full/archdbms-2-2x2 8/16

Examples for dynamic allocation strategies

1   Local LRU  (cf.  LRU replacement, later)

  keep a separate  LRU-stack  for each active TX, and

  a global  freelist   for pages not pinned by any TX

Strategy:

i. replace a page from the  freelist 

ii. replace a page from the LRU-stack of the requesting TX

iii. replace a page from the TX with the largest LRU-stack

2   Working Set Model  (cf.  operating systems’ virtual memory management)

Goal:   avoid thrashing by allocating “just enough” buffer space to each TX

Approach:   observe number of   different   page requests by each TX within a

certain intervall of time (window size  τ )

 deduce “optimal” buffer budget from this observation,

 allocate buffer budgets according to the ratio between those optimal sizes

41

Implementation of the Working Set Model

Let  W S(T, τ ) be the “working set” of TX  T   for window size  τ ,  i.e.,

W S(T, τ ) = {pages referenced by  T   in the inverall [now − τ, now]}.

The strategy is to keep, for each transaction  T i , its working set,  W S(T i , τ ) in the

buffer.

Possible implementation: keep two counters, per TX and per page, resp’ly

  tr c (T i ) . . . TX-specific reference counter,   lr c (T i , P  j ) . . . TX-specific last reference counter for each referenced page P  j .

Idea of the algorithm:

  Whenever  T i   references  P  j :

→   increment  t r c (T i );

→   copy  t rc (T i ) to l rc (T i , P  j ).

 If a page has to be replaced for  T i , select among those with

tr c (T i )− lr c (T i , P  j ) ≥ τ.

42

2.3.2 Buffer replacement policies

 The choice of   victim frame selection  (or buffer replacement)   policy   can con-

siderably affect DBMS performance.

 Large number of policies in operating systems and database mgmt. systems.

Criteria for victim selection used in some strategies

Criteria Age of page in buffer

no since last ref. total age

none   Random   FIFO

References last

LRU

CLOCK

GCLOCK(V1)

all   LFU  GCLOCK(V2)

DGCLOCK  LRD(V1)

LRD(V2)

43

Schematic overview of buffer replacement policies

 

 

 

 

ref to Ain buffer 

ref to Cnot in buffer 

victim

page

 

victim

page 

rc age

gc

"used" bit

ref count

possibly initializedwith weights

44

Page 9: ArchDBMS-2-2x2

7/23/2019 ArchDBMS-2-2x2

http://slidepdf.com/reader/full/archdbms-2-2x2 9/16

  Two policies found in a number of DBMSs:

1   LRU (“least recently used”)

  Keep a   queue  (often described as a  stack ) of pointers to frames.

  In unpinPage(p, d ), append p  to the  tail  of queue, if  pinCount(p ) is decremented

to 0.

  To find the next victim, search through the queue from its   head  and find the

first page  p  with   pinCount(p ) = 0.

2   Clock  (“second chance”)

 Number the  N  frames in buffer pool 0 . . . N  −1, initialize counter   current ← 0,

and maintain a bit array   referenced[0 . . . N  − 1], initialized to all 0.

  In   pinPage(p ), do   reference[p ] ← 1.

 To find the next victim, consider page   current.

If   pinCount(current) = 0 and   referenced[current] = 0,   current   is the vic-

tim.

Otherwise,   referenced[current] ← 0,   current ← (current  + 1) mod N , re-

peat.

  Generalization: LRU(k ) – take timestamps of last   k   references into account.

Standard LRU is LRU(1).

45

N.B.  LRU as well as Clock are  heuristics   only. Any heuristic can fail miserably in

certain scenarios:

A challenge for LRU

A number of transactions want to scan the same sequence of pages ( e.g.,

SELECT   ∗   FROM   R) one after the other. Assume a buffer pool with a

capacity of 10 pages.

1   Let the size of relation  R  be 10 or less pages. How many I/Os do you

expect?

2  Let the size of relation  R  be 11 pages. What about the number of I/O

operations in this case?

Other well-known replacement policies are,  e.g.,

  FIFO  (“first in, first out”),  LIFO (“last in, first out”)

  LFU (“least frequently used”),  MRU  (“most recently used”),

  GCLOCK  (“generalized clock”),  DGCLOCK  (“dynamic GCLOCK”)

  LRD (“least reference density”),

  WS,  HS  (“working set”, “hot set”) – see above,

  Random.46

LRD – least reference density

Record the following three parameters

  tr c (t ) . . . total reference count of transaction t ,

  age (p ) ... value of  t r c (t ) at the time of loading  p  into buffer,

  r c (p ) . . . reference count of page p .

Update these parameters during a transaction’s page references ( pinPage-Calls).

From those, compute mean reference density of a page  p  at time  t   as:

r d (p, t ) :=  rc (p )

tr c (t ) − age (p )  .. . where  t rc (t )− r c (p ) ≥ 1

Strategy for victim selection:  chose page with least reference density  r d (p, t )

. . . many variants,  e.g., for gradually disregarding old references.

47

Exploiting semantic knowledge

  Query compiler/optimizer . . .

  selects access plan,  e.g., sequential scan vs. index,

 estimates number of page I/Os for cost-based optimization.

  Idea: use this information to determine query-specific, optimal buffer budget

Query Hot Set model.

Goals:

 optimize overall system throughput;

  to avoid thrashing is the most important goal.

48

Page 10: ArchDBMS-2-2x2

7/23/2019 ArchDBMS-2-2x2

http://slidepdf.com/reader/full/archdbms-2-2x2 10/16

Hot Set with disjoint page sets

1   Only those queries are activated, whose Hot Set buffer budget can be satisfied

immediately.

2  Queries with higher demands have to wait until their budget becomes available.

3  Within its own buffer budget, each transaction applies a local LRU policy.

Properties:

 No sharing of buffered pages between transactions.

  Risk of “internal thrashing” when Hot Set estimates are wrong.

 Queries with large Hot Sets block following small queries.

(Or, if bypassing is permitted, many small queries can lead to starvation of large

ones.)

49

Hot Set with non-disjoint page sets

1   Queries allocate their budget stepwise, upto the size of their Hot Set.

2   Local LRU stacks are used for replacement.

3   Request for a page  p :

(i) If found in  own LRU-stack : update LRU-stack.

(ii) If found in another transaction’s LRU-stack : access page, but don’t update

the other LRU-stack.

(iii) If found in  freelist : push page on own LRU-stack.

4   unpinPage: push page onto  freelist -stack.

5   Filling empty buffer frames: taken from the bottom of the  freelist -stack.

N.B.

 As long as a page is in a local LRU-stack, it cannot be replaced.

 If a page drops out of a local LRU-stack, it is pushed onto the  freelist -stack.

 A page is replaced only if it reaches the bottom of the  freelist -stack before some

transaction pins it again.50

Priority Hints

  Idea:   with  unpinPage, a transaction gives one of the two possible indications

to the buffer manager:

  “preferred page ” . . . those are managed in a TX-local parition,

  “ordinary page ” . . . managed in a global partition.

  Strategy:  when a page needs to be replaced,

1. try to replace an ordinary page from the global partition using LRU;

2. replace a preferred page of the requesting TX according to MRU.

  Advantages:

  much simpler than DBMIN (“Hot Set”),

 similar performance,

  easy to deal with “too small” partitions.

51

Prefetching

. . . when the buffer manager receives requests for (single) page(s), it may decide

to (asynchronously) read  ahead 

  on-demand, asynchronous read-ahead:

e.g., when traversing the sequence set of an index, during a sequential scan of 

a relation, . . .

  heuristic (speculative) prefetching:

e.g., sequential  n -block lookahead (cf.   drive or controller buffers in harddisks),

semantically determined supersets, index prefetch, . . .

52

Page 11: ArchDBMS-2-2x2

7/23/2019 ArchDBMS-2-2x2

http://slidepdf.com/reader/full/archdbms-2-2x2 11/16

2.3.3 Buffer management in DBMSs vs. OSs

Buffer management for a DBMS curiously “tastes” like the  virtual memory1 con-

cept of modern operating systems. Both techniques provide access to more data

than will fit into primary memory.

So: why don’t we use OS virtual memory facilities to implement DBMSs?

 A DBMS can predict certain  reference patterns for pages in a buffer a lot better 

than a general purpose OS.

  This is mainly because page references in a DBMS are initiated by  higher-leveloperations  (sequential scans, relational operators) by the DBMS itself.

Reference pattern examples in a DBMS

1   Sequential scans  call for  prefetching.

2   Nested-loop joins  call for page  fixing  and  hating.

  Concurrency control protocols often   prescribe the order   in which pages are

written back to disk. Operating systems usually do not provide hooks for that.

1Generally implemented using a hardware interrupt mechanism called page faulting.53

Double Buffering

If the DBMS uses its own buffer manager (within the virtual memory of the DBMS

server process), independently from the OS VM manager, we may experience the

following:

  Virtual page fault: page resides in DBMS buffer. However, frame has been

swapped out of physical memory by OS VM manager.

An I/O operation is necessary that is not visible to the DBMS.

  Buffer fault: page does not reside in DBMS buffer, frame is in physical memory.

Regular DBMS page replacement, requiring an I/O operation.

  Double page fault: page does not reside in DBMS buffer, frame has been

swapped out of physical memory by OS VM manager.

Two I/O operations necessary: one to bring in the frame (OS)2;

another one to replace the page in that frame (DBMS).

=⇒   DBMS buffer needs to be  memory resident  in OS.

2OS VM does not know “dirty flags”, hence brings in pages that could simply be overwritten.54

2.4 File and record organization

Files and Index Structures

BufferManager

Disk Space Manager

RecoveryManager

Plan Executor

Operator Evaluator Optimizer

Parser

ApplicationsWeb Forms SQL Interface

SQL Commands

Query Processor

Concurrency ControlDBMS

Database

Index Files

Data FilesSystem Catalog

TransactionManager

Lock

Manager

 Y o u  a r e  h e r

 e !

 We will now turn away from page management

and will instead focus on  page usage in a DBMS.

  On the conceptual level, a   relational   DBMS

manages  tables of tuples3,   e.g.

A B C 

......

...

42   true    ’foo’

... ... ...

 On the physical level, such tables are represented

as   files of records   (tuple = record), each page

holds one or more records

(in general,  |record| |page|).

  A file is a collection of records that may reside

on several pages.

3More precisely, table actually means bag here (set of elements with multiplicity  ≥  0).55

2.4.1 Heap files

  The most simple file structure is the  heap file   which represents an  unordered

collection of records.

 As in any file structure, each record has a   unique record identifier  (rid ).

 A typical heap file interface supports the following operations:

  create/destroy  heap file f    named  n:

createFile(n) /  deleteFile(f  )   insert  record  r   and return its  rid :

insertRecord(f , r )   delete  a record with a given  rid :

deleteRecord(f , rid )   get   a record with a given  rid :

getRecord(f , rid )   initiate a sequential  scan  over the whole heap file:

openScan(f  )

  N.B.  Record ids (rids ) are used like   record addresses  (or pointers). Internally,

the heap file structure must be able to map a given  rid   to the page containing

the record.56

Page 12: ArchDBMS-2-2x2

7/23/2019 ArchDBMS-2-2x2

http://slidepdf.com/reader/full/archdbms-2-2x2 12/16

  To support  openScan(f  ), the heap file structure has to  keep track of all pages

in file  f  ; to support insertRecord(f , r ) efficiently, we need to  keep track of 

all pages with free space in file  f  .

  Let us have a look at two simple structures which can offer this support.

2.4.2 Linked list of pages

  When  createFile(n) is called,

1   the DBMS allocates a free page (the file header ) and writes an appropriate

entry n, header page   to a known location on disk;

2   the  header page   is initialized to point to  two doubly linked lists of pages :

data

page page

page page full pages

with free space

linked list of pages

data data

data

linked list of

page

header

3  Initially, both lists are empty.

57

Remarks:

  For insertRecord(f , r ),

1  try to find a page  p   in the free list with free space  > |r |; should this fail, ask

the disk space manager to allocate a new page  p ;

2   record  r   is written to page  p ;

3   since generally  |r | |p |,  p   will belong to the list of pages with free space;

4   a unique  rid   for r   is computed and returned to the caller.

  For openScan(f  ),

1   both   page lists have to be traversed.

 A call to  deleteRecord(f , rid )

1   may result in moving the containing page from full to free page list,

2   or even lead to page deallocation if the page is completely free after deletion.

Finding a page with sufficient free space  . . .

. . . is an important problem to solve inside insertRecord(f , r ). How does

the heap file structure support this operation? (How many pages of a file

do you expect to be in the list of free pages?)

58

2.4.3 Directory of pages

 An alternative to the linked list approach is to maintain a   directory of pages in

a file.

  The header page  contains the first page of a chain of directory pages; each entry

in a directory page identifies a page of the file:

header

page

data

data

data

page

page

pagepage directory

59

Remarks:

  |page directory| |data pages|

  Free space management is also done via the directory:

each directory entry is actually of the form  page addr   p, nfree , where   nfree 

indicates the actual amount of free space  (e.g.   in bytes) on page  p .

I/O operations and free space managementFor a file of 10000 pages, give lower and upper bounds for the number of page

I/O operations during an   insertRecord(f , r ) call for a heap file organized

using

1  a linked list of pages,

2   a directory of pages (1000 directory entries/page).

linked list

lower bound: header page + first page in free list + write  r  = 3 page I/Os

upper bound: header page + 10000 pages in free list + write  r  = 10002 page I/Os

directory   (1000 entries/page)

lower bound: directory header page + write  r  = 2 page I/Os

upper bound: 10 directory pages + write  r  = 11 page I/Os

60

Page 13: ArchDBMS-2-2x2

7/23/2019 ArchDBMS-2-2x2

http://slidepdf.com/reader/full/archdbms-2-2x2 13/16

2.5 Page formats

 Locating the containing data page for a given  rid   is  not   the whole story when

the DBMS needs to access a specific record: the internal structure of pages

plays a crucial role.

  For the following discussion, we consider a page to consist of a sequence of 

slots, each of which contains a record.

  A complete  record identifier  then has the unique form

page addr , nslot 

where  nslot   denotes the slot number on the page.

61

2.5.1 Fixed-length records

Life is particularly easy if   all records on the page (in the file) are of the same

size s ;

  getRecord(f , p, n):

given the  rid   p, n  we know that the record is to be found at (byte) offset  n × s   on

page  p .

  deleteRecord(f , p, n):

copy the bytes of the last occupied slot on page  p   to offset  n × s , mark last slot as

free;

all occupied slots thus appear together at the start of the page (the page is  packed).   insertRecord(f , r ):

find a page  p  with free space  ≥  s  (see previous section); copy  r    to the first free slot

on  p , mark slot as occupied.

Packed pages and deletions:

One problem with packed pages remains, though:

  calling   deleteRecord(f , p, n) modifies the  rid  of a different record p, n

on the same page.

  If any  external reference   to this record exists we need to chase the whole

database and update  rid   references  p, n → p, n.

Bad!62

 To avoid record copying (and thus  rid  modifications), we could simply use a  free

slot bitmap   on each page:

packed unpacked w/ bitmap

freespace

page

of recordsnumber

header

of slotsnumber

slot N−1

slot 0

slot 1slot 0slot 1slot 2

slot M−1

N

1

1   M

23

01

M−1

  Calling deleteRecord(f , p, n) simply means to set bit  n  in the bitmap to 0,

no other  rid s are affected.

Page header or trailer?

In both page organization schemes we have positioned the   page header

at the end of its page. How would you justify this design decision?

63

2.5.2 Variable-length records

If records on a page are of   varying size   (cf.   the SQL datatype  VARCHAR(n)), we

have to deal with  page fragmentation:

  In  insertRecord(f , r ) we have to find an empty slot of size  ≥ |r |; at the same

time we want to try to minimize waste.

 To get rid of   holes  produced by  deleteRecord(f , rid ), compact the remaining

records to maintain a  contiguous area of free space   on the page.

A solution is to maintain a   slot directory  on each page (compare this with a heap

file directory!):

number of entries

in slot directory

offset of recd from

start of data area

N24

slot directory

rid = <p,0> 

rid = <p,1> 

N−1 01

20 16

rid = <p, N−1> 

24 bytes

start of

free space

pointer to

page p 

64

Page 14: ArchDBMS-2-2x2

7/23/2019 ArchDBMS-2-2x2

http://slidepdf.com/reader/full/archdbms-2-2x2 14/16

Remarks:

  The slot directory contains entries  offset , length  where  offset  is measured in

bytes from the data page start.

  In   deleteRecord(f , p, n), simply set the offset of directory entry   n   to  −1;

such an entry can be reused during subsequent  insertRecord(f , r ) calls which

hit page p .

Directory compaction . . .. . . i s  not  allowed in this scheme (again, this would modify the  rid s of 

all records  p, n, n > n)!

If insertions are much more common than deletions, the directory size

will nevertheless be close to the actual number of records stored on

the page.

N.B.  Record compaction (defragmentation)  is  performed, of course.

65

2.6 Record formats

 This section zooms into the  record internals themselves, thus discussing access

to single record fields (conceptually: attributes). Attribute values are considered

atomic by an RDBMS.

  Depending on the   field types, we are dealing with   fixed-length   or  variable-

length  fields in a record,  e.g.

SQL datatype fixed-length? length (# of bytes)4

INTEGER     4BIGINT     8

CHAR(n)     n,   1 ≤ n ≤ 254

VARCHAR(n) 1 . . . n ,   1 ≤ n ≤ 32672

CLOB(n) 1 . . . n ,   1 ≤ n ≤ 231

DATE     4...

...

  The DBMS computes and then saves the field size information for the records

of a file in the   system catalog   when it processes the corresponding command

CREATE TABLE . . .   .

4I Datatype lengths valid for DB2 V7.1.66

2.6.1 Fixed-length fields

If all fields in a record are of fixed length,  offsets for field access  can simply be read

off the DBMS system catalog (field f i  of size li ):

base address b

l1 l2 l3 l4

address = b+l1+l2

f1 f2 f3 f4

2.6.2 Variable-length fields

If a record contains one or more variable-length fields, other record representations

are needed:1   Use a special  delimiter symbol  ($) to separate record fields. Accessing field f i   means

to scan over the bytes of fields f1 . . . f(n − 1);

2   for a record of  n  fields, use an  array  of  n  + 1 offsets pointing into the record (the last

array entry marks the end of field f n):

f3 f4f1 $ $f2 $ $

f3f2f1 f4

67

Final remarks:

Variable-length record formats seem to be more complicated but, among other

advantages, they allow for the compact representation of SQL   NULL   values   (field

f3 is  NULL  below):

f2f1

f1 $ $f2 $ f4

f4

Growing a record

Consider an update on a field (e.g.   of type   VARCHAR(n)) which lets the recordgrow beyond the size of the free space on its containing page.

How could the DBMS handle this situation efficiently?

Really growing a recordFor fields of type   VARCHAR(n) or   CLOB(n) with  n > |page size|  we are in trouble

whenever a record actually grows beyond page size (the record won’t fit on any

one page).

How could the DBMS file manager cope with this?68

Page 15: ArchDBMS-2-2x2

7/23/2019 ArchDBMS-2-2x2

http://slidepdf.com/reader/full/archdbms-2-2x2 15/16

2.7 Addressing schemes

What makes a “good” record ID (rid)?

  given a  rid , it should ideally not take more than 1 page I/O to get to the record

itself 

  rid s should be stable under all circumstances, such as

  a record is being moved within a page

  a record is being moved across pages

Why are these goals important to achieve?

Consider the fact that  rid s are used as “persistent pointers” in a DBMS

(indexes, directories, implemenation of CODAYSL sets,  . . .)

Conflicting goals!

Efficiency calls for “direct disk address”, while stability calls for some kind

of indirection.

69

Direct addressing

  RBA – relative byte address:

Consider disk file as a persistent virtual address space and use byte-offset as  rid .

pro:   very efficient access to page and to record within page

con:   no stability at all w.r.t. moving record

  PP – page pointers:

Use disk page numbers as  rid .

pro:   very efficient access to page; locating record within page is cheap (in-

memory operation)

con:   stable w.r.t. moving records within page, but not when moving across

pages

70

Indirect addressing

  LSN – logical sequence numbers:

Assign logical numbers to records. Address translation table maps to PPs (or

even RBAs).

pro:   full stability w.r.t. all relocations of records

con:   additional I/O to translation table (often in the buffer)

CODASYL systems call this “DBTT” (database key translation table):

 

71

Indirect addressing – fancy variant

  LSN/PPP – LSN with probable page pointers:

Try to avoid extra I/O by adding a “probable” PP (PPP) to LSNs. PPP is the

PP at the time of insertion into the database. If record is moved across pages,

PPPs are not  updated!

pro:  full stability w.r.t. all record relocations; PPP can save extra I/O for trans-

lation table, iff still correct

con:   2 additional page I/Os in case PPP is no longer valid: “old” page to notice

record has moved, second I/O to translation table to lookup new page number

72

Page 16: ArchDBMS-2-2x2

7/23/2019 ArchDBMS-2-2x2

http://slidepdf.com/reader/full/archdbms-2-2x2 16/16

TID addressing

 TID – tuple identifier with forwarding:

Use PP, Slot#-pair as  rid  (see above). To guarantee stability, leave a forward

address on original page, if record has to be moved across pages.

For example: access record with given  rid =17, 2:

 

Avoid chains of forward addresses!

When record has to be moved again: do not leave another forward

address, rather update forward on original page!

pro:  full stability w.r.t. all relocations of records; no extra I/O due to indirection

con:   1 additional page I/O in case of forward pointer on original page

73

Bibliography

Brown, K., Carey, M., and Livny, M. (1996). Goal-oriented buffer management revisited.

In  Proc. ACM SIGMOD Conference on Management of Data.

Chen, P., Lee, E., Gibson, G., R.H.Katz, and Patterson, D. (1994). Raid: High-

performance, reliable secondary storage.  ACM Computing Surveys , 26(2):145–185.

Denning, P. (1968). The working-set model for program behaviour.   Communications of   

the ACM , 11(5):323–333.

Elmasri, R. and Navathe, S. (2000).  Fundamentals of Database Systems . Addison-Wesley,

Reading, MA., 3 edition. Titel der deutschen Ausgabe von 2002: Grundlagen vonDatenbanken.

Harder , T. (1987) .  Realisierung von operationalen Schnittstellen, chapter 3. in (Lockemann

and Schmidt, 1987). Springer.

Harder , T. (1999) .   Datenbanksysteme: Konzepte und Techniken der Implementierung .

Springer.

Harder, T. and Rahm, E. (2001).   Datenbanksysteme: Konzepte und Techniken der Im-

plementierung . Springer Verlag, Berlin, Heidelberg, 2 edition.

Heuer, A. and Saake, G. (1999).   Datenbanken: Implementierungstechniken. Int’l Thomp-

son Publishing, Bonn.

Lockemann, P. and Dittrich, K. (1987).  Architektur von Datenbanksystemen, chapter 2.

in (Lockemann and Schmidt, 1987). Springer.74

Lockemann, P. and Schmidt, J., editors (1987).   Datenbank-Handbuch. Springer-Verlag.

Mitschang, B. (1995).  Anfrageverarbeitung in Datenbanksystemen - Entwurfs- und Imple-

mentierungsaspekte . Vieweg.

O’Neil, E., O’Neil, P., and Weikum, G. (1999). An optimality proof of the LRU-k   page

replacement algorithm.   Journal of the ACM , 46(1):92–112.

Ramakrishnan, R. and Gehrke, J. (2003).  Database Management Systems . McGraw-Hill,

New York, 3 edition.

Stonebraker, M. (1981). Operating systems support for database management.   Commu-

nications of the ACM , 14(7):412–418.

75