1 Storage and File Structure 1. Classification of physical storage media 2. Storage access 3. File organization 4. Indexing 5. B + - trees 6. Static hashing
Dec 31, 2015
1
Storage and File Structure
1. Classification of physical storage media
2. Storage access
3. File organization
4. Indexing
5. B+ - trees
6. Static hashing
2
1. Classification of Physical Storage Media
Criteria:
speed with which data can be accessed
cost per unit of data
reliability
data loss on power failure or system crash
physical failure of the storage device
Volatile storage: loses contents when power is switched off
Non-volatile storage: contents persist even when power is switched off.
Includes secondary and tertiary storage, as well as battery-backed up main memory
3
Cache: the fastest and most costly form of storage
volatile
managed by the hardware /operating system
Main memory: sometimes referred to as core memory
volatile -contents of main memory are usually lost if a power failure or system crash occurs
general-purpose machine instructions operates on data resident in main memory
fast access, but generally too small to store the entire database
Flash memory:non-volatile memory (data survives power failure)
reads are roughly as fast as main memory
can support a only limited number of write/erase cycles
4
Magnetic-disk storage
primary medium for the long-term storage of data
typically stores entire database
data must be moved from disk to main memory for access and written back for storage
direct-access- possible to read data on disk in any order
usually survives power failures and system crashes
disk failure can destroy data, but is much less frequent than system crashes
Optical storage: non-volatile
the most popular: CD-ROM
write-once, read-many (WORM) optical disks:used for archival storage
5
Tape storage:
non-volatile
used primarily for
backup (to recover from disk failure)
archive data
sequential-access - much slower than direct access,disk
very high capacity (5 GB is common)
tapes can be removed from drive:--> storage costs much cheaper than disk
6
Storage hierarchy
cache
Main memory
Flash memory
Magnetic disk
Optical disk
Magnetic tape
7
Primary storage: fastest media but volatile
CACHE, MAIN MEMORY
Secondary storage: moderately fast access time, non-volatile
also called on-line storage
FLASH MEMORY, MAGNETIC DISKS
Tertiary storage: slow access time, non-volatile
also called off-line storage
MAGNETIC TAPES, OPTICAL STORAGE
8
Magnetic disks:
Read-write head: device positioned closed to the platter surface
reads or writes magnetically encoded information
Surface of platter is divided into circular tracks
Each track is divided into sectors
A sector is the smallest unit of data that can be read or written
Cylinder j consists of the j-th track of all the platters
Head-disk assemblies- multiple disk platters on a single spindle, with multiple heads (one per platter) mounted on a common armTo read/write a sector:
disk arm swings to position head on the right track
platter spins continually; data is read/written when sector comes under head
9
spindleTrack t
Sector s
Cylinder c
platter
Arm assemblyarm
read-write head
10
Disk subsystem
System Bus
Disk controller
disks
Interfaces between the computer system and the disk drive hardware
Accepts high-level commands to read or write a sector
Initiates actions such as moving the disk arm to the right track and actually reading or writing the data
11
Performance measures of disks
Access time: the time it takes from when a read or write request is issued to when data transfer begins. Consists of:
seek time: time it takes to reposition the arm over the correct track. Average seek time is 1/3rd the
worst case seek time.
rotational latency: time it takes for the sector to be accessed to appear under the head. Average latency is 1/2 of the worst case latency.
Data-transfer rate: the rate at which data can be retrieved from or stored to the disk
Mean time to failure (MTTF): the average time the disk is expected to run continuously without any failure
12
Optimisation of Disk-Block Access
Block a contiguous sequence of sectors from a single track
data is transferred between disk and main memory in blockssizes range from 512 bytes to several KB
Disk-arm-scheduling algorithms order accesses to tracks so that disk arm movement is minimized (e. g. elevator algorithm )
File organization: optimise block access time by organizing the blocks to correspond to how data will be accessed. Stores related information on the same or nearby cylinders
Non-volatile write buffers: speed up disk writes by writing blocks to a non-volatile RAM buffer immediately. Controller than writes to disk whenever the disk has no other requests
Log disk: a disk devoted to writing a sequential log of block updates; this eliminates seek time. Used like nonvolatile RAM
13
RAID = Redundant Arrays of Inexpensive Disks
it is a disk organization techniques that take advantage of utilizing large numbers of inexpensive, mas-market disks
originally a cost-effective alternative to large, expensive disks
today RAIDs are used for their higher reliability and bandwidth, rather than for economic reasons, hence I is interpreted as independent instead of inexpensive
The chance that some disk out of a set of N disks will fail is much higher than the chance that a specific single disk will fail.
For instance, a system with 100 disks, each with MTTF of 100,000 hours (approx. 11 years) will have a system MTTF of 1000 hours (approx. 41 days)
14
Improvement of Reliability via Redundancy
Redundancy store extra information than can be used to rebuild information lost in a disk failure.
EX: Mirroring (Shadowing): duplicate every disk.
Logical disk consists of two physical disks
every write is carried out on both disks
if one disk in a pair fails, data is still available on the other
15
Improvement in Performance via Parallelism
Two main goals of parallelism in a disk system:
1. Load balance multiple small accesses to increase throughput
2. Parallelize large accesses to reduce response time
Improve transfer rate by striping data across multiple disks:
1. Bit-level striping: split the bits of each byte across multiple disks
in an array of 8 disks, write bit j of each byte on disk j
each access can read data at 8 times the rate of a single disk
but seek/access time worse than for a single disk
2. Block-level striping: with n disks, block j of a file goes to disk (j mod n) + 1
16
RAID levels
Schemes to provide redundancy at lower cost by using disk striping combined with parity bits
Different RAID organizations (RAID levels) have differing cost, performance and reliability characteristics
Level 0: striping at the level of blocks
non-redundant
used in high-performance applications where data loss is not critical
Level 1: mirrored disks
offers best write performance
popular for applications such as storing log files in a database system
17
Level 2: Memory-Style Error-Correcting-Codes (ECC) with bit striping
Level 3: Bit-Interleaved Parity: a single parity bit can be used for error correction, not just detection
When writing data, parity bit must also be computed and written
faster data transfer than with a single disk, but fewer I/Os per second since every disk has to participate in every I/O
subsumes Level 2 (provides all its benefits, at lower cost)
18
Level 4: Block-Interleaved Parity
uses block-level striping
keeps a parity block on a separate disk for corresponding blocks from N other disks
provides higher I/O rates for independent block reads than Level 3 (bloc read goes to a single edisk, so blocks stored on different disks can be read in parallel
provides high transfer rates for reads of multiple blocks
however, parity block becomes a bottleneck for independent block writes since every block write also writes to parity disk
19
Level 5: Block-Interleaved Distributed Parity
partions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk
e.g. with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) + 1, with the data blocks stored on the other 4 disks
higher I/O rates than level 4. (block writes occur in parallel if the blocks and their parity blocks are on different disks).
Subsumes Level 4
Level 6: P + Q redundancy Scheme
similar to Level 5 but store extra redundant information to guard against multiple disk failures
better reliability than Level 5 at a higher cost
not used as widely
20
Optical disks
Compact disk- read only memory (CD-ROM)
disks can be loaded into or removed from a drive
high storage capacity (500 MB)
high seek times and latency
lower data-transfer rates than magnetic disks
Digital Video Disk (DVD)- new optical format
hold 4.7 to 17 GB
WORM disk (write-once read-many)
can be written using the same drive from which they are read
high capacity and long lifetime
used for archival storage
WORM jukebox
21
Magnetic Tapes
hold large volumes of data (5 GB usual)
currently the cheapest storage medium
very slow acces time in comparison to magnetical and optical disks
limited to sequential access
used mainly for backup, for storage of infrequently used information, as an off-line medium from transfering information from one system to another
tape jukeboxes used for very large capacity (terabype, 1012 to petabyte, 1015 storage)
22
2. Storage Access
Block: A database file is partitioned into fixed-length storage units
Unit of both storage allocation and data transfer
DBMS seeks to minimize the number of block transfers between the disk and the memory: by keeping as many blocks as possible in the main memory
Buffer portion of the main memory available to store copies of disk blocks
Buffer Manager subsystem responsible for allocating buffer space in main memory
23
Buffer Manager
Programs call on the BM when they need a block from disk
if it is already present in the buffer, the requesting program is given the address of the block in main memory
if the block is not in the buffer, the BM allocates space in the buffer for the block, replacing (throwing out) some other block, if required, to make space for the new block
the block that is thrown out is written back to disk only if it was modified since the most recent time it was
written to/fetched from the disk
once space is allocated in the buffer, the BM reads in the block from the disk to the buffer, and passes the address of the block in main memory to the requester
24
Buffer -Replacement Policies
Most operating systems replace the block least recently used (LRU)
LRU - use past pattern of block references as a predictor of future references
Queries have well-defined access patterns (such as sequential scans), and a DBMS can use the information in a user’s query to predict future references
LRU can be a bad strategy for certain access patterns involving repeated scans of data
Mixed strategy with hints on replacement strategy provided by the query optimizer is preferable
25
Pinned block memory block that is not allowed to be written back to disk
Toss-immediate strategy frees the space occupied by a block as soon as the final tuple of that block
has been processed
Most Recently Used (MRU) strategy: system must pin the block currently being processed. After the final tuple of that block has been processed, the block is un pinned, and it becomes the mru block.
Buffer manager (BM) can use statistical information regarding the probability that a request will reference a particular relation.
EX: the data dictionary is frequently accessed
Heuristic: keep data-dictionary blocks in main memory buffer
26
3. File Organization
The database is stored as a collection of files.
Each file is a sequence of records.
A record is a sequence of fields.
27
To delete the j record- alternatives:
1. Move records j+1, .., n to j, …, n-1
2. Move record n to j
3. Link all free records on a free list
The simplest approach:
record size is fixed (n) -->
store record j starting from byte n*(j -1)
record access is simple but records may cross blocks
each file has records of one particular type only
different files are used for different relations
28
Free list
Store the address of the first record whose contents are deleted in the file header
Use this first record to store the address of the second available record, and so on
Can think of these stored addresses as pointers since they “point” to the location of a record
header
Record 2
Record 3
Record 4
Record 5
Record 6
Record 7
Record 0
Record 1
Record 8
John A100 300
Marry A200 600
Arthur A300 700
John A1001 3000
Marry A106 500
Richard A1004 500
nill
29
More space efficient representation: reuse space for normal attributes of free records to store pointers
(no pointers stored in in-use records)
Dangling pointers occur if we move or delete a record to which another record contains a pointer
That pointer no longer points to the desired record
Avoid moving or deleting records that are pointed to by other records.
Such records are pinned.
30
Variable-Length Records
Arise in database systems in several ways:
storage of multiple record types in a file
record types that allow variable lengths for one or more fields
record types that allow repeating fields (used in some older data models)
Byte string representation
attach an end-of-record () control character to the end of each record
difficulty with deletion
difficulty with growth
31
Header contains:
number of record entries
end of free space in the block
location and size of each record
Page structure:
#entries
End fr spFree space
32
Records can be moved around within a page to keep them contiguous with no empty space between them;
entry in the header must then be updated
Pointers should not point directly to record - instead they should point to the entry for the record in header
Fixed-length representation:
reserved space: can use fixed-length records of a known maximum length
unused space in shorter records filled with a null or end-of-record symbol
pointers: the maximum record length is known
a variable-length record is represented by a list of fixed-length records, chained together via pointers
33
Disadvantage to pointer structure:
space is wasted in all records, except the first in a chain
Solution is to allow two kinds of block in a file:
1. Anchor block: contains the first record of a chain
2. Overflow block: contains records other than those that are the first records of chains
34
Organization of records in Files
Heap a record can be placed anywhere in the file where there is space
Sequential store records in sequential order, based on the value of the search key of each record
Hashing a hash function is computed on some attribute of each record
the result specifies in which block of the file the record should be placed
Clustering records of several different relations can be stored in the same file
related records are stored on the same block
35
Sequential File Organization
Suitable for applications that require sequential processing of the entire file
The records in the file are ordered by a search-key
INSERTION: must locate the position in the file where the record is to be inserted:
if ther is free space, insert there
if no free space, insert the record in an overflow block
in either case, pointer chain must be updated
DELETION: use pointer chains
Need to reorganize the file from time to time to restore sequential order
36
Brighton A217 750Downtown A101 540Downtown A110 400Linberg A215 200Perryton A123 100Perryton A130 500Redwood A213 700
Roundhill A334 112
37
Clustering File Organization
Simple file structure: stores each relation in a separate file
Alternative: store several relations in one file using a clustering file organization
Good for queries involving all relations
Bad for queries involving a single relation
Results in variable size records
38
Data Dictionary Storage (System Catalog)
It stores metadata (data about data):
information about relations
names of relations, names and types of attributes, physical file organization structure, statistical data (e.g. number of tuples in each relation)
integrity constraints
view definitions
user and accounting information
information about indices
39
Catalog structure: can use either
specialized data structures designed for efficient acces, OR
a set of relations, with existing system features used to ensure efficient access (preferred)
EXAMPLE.
System-catalog-schema = (relation-name, number-of-attributes)
Attribute-schema = (attribute-name, relation-name, domain-type, position, length)
User-schema = (user-name, encrypted-password, group)
Index-schema = (index-name, relation-name, index-type, index-attributes)
View-schema = (view-name, definition)
40
4.Indexing
Search-key
Indexing mechanisms used to speed up access to desired data
ex: author catalog in a library
search key (set of) attribute(s) used to look up records in a file
index file consists of records (index entries) of the form
pointer
Index files are typically much smaller than the original file
Two basic kinds of indices:
ordered indices: search keys are stored in sorted order
hash indices: search keys are distributed uniformly across “buckets” using a “hash function”
41
Index Evaluation metrics
Indexing techniques are evaluated on basis of:
• access types supported efficiently, e.g.
records with a specified value in an attribute
or records with an attribute value falling in a
specified range of values
• access time
• insertion time
• deletion time
• space overhead
42
Ordered IndicesIndex entries are stored sorted on the sort key value
Primary index: in a sequentially ordered file:
the index whose search key specifies the sequential order of the file
also called clustering index
the search key of a primary index is usually but not necessarily the primary key
Secondary index: an index whose search key specifies an order differrent from the sequential order of the file.
also called non-clustering index
Index-sequential file: an ordered sequential file with a primary index
43
Dense Index Files:
index record appears for every search - key value in the file
Sparse Index Files:
• index records appear for some search-key values -->
• less space & maintenance overhead for insertions & deletions
• generally slower than dense index for locating records
• to locate a record with search-key value K we
1. Find index record with largest search-key value < K
2. Search file sequentially starting at the record to which the index record points
• good tradeoff: sparse index with an index entry for every block in file (the least search-key value in the block)
44
Brighton A217 750Downtown A101 540Downtown A110 400Linberg A215 200Perryton A123 100Perryton A130 500Redwood A213 700
Roundhill A334 112
BrightonDowntownLinbergPerrytonRedwood
Roundhill
Example of dense index: index record appears for every search-key value in the file
45
Brighton A217 750Downtown A101 540Downtown A110 400Linberg A215 200Perryton A123 100Perryton A130 500Redwood A213 700
Roundhill A334 112
Brighton
Linberg
Redwood
Example of sparse index: index records appear for some search-key values in the file
46
Multilevel Index
•If primary index does not fit in memory: access becomes expensive
•to reduce number of disk accesses to index records:
treat primary index (inner index) kept on disk as a sequential file
construct a sparse index (outer index) on it
•if even outer index is too large to fit in main memory:
another level of index can be created , and so on
•indeces at all levels must be updated on insertion or deletion from the file
47
...
Data block 0
Data block 1
Data block n-1
...
…
Inner index
...
Index block 0
Index block 1
Outer index
48
Index Update: Deletion
If deleted record was the only record in the file with its particular search-key value, the search-key is deleted from the index too.
Singel-level index deletion:
1. Dense indeces:
deletion of search-key is similar to file record deletion
2. Sparse indeces:
if an entry for the search key exists in the index:
it is deleted by replacing the entry in the index with the next search-key value in the file(in search-key order)
if the next search-key value already has an index entry, the entry is deleted instead of being replaced
49
Index Update: Insertion
Single-level index insertion:
1. Perform a lookup using the search-key value appearing in the record to be inserted
2a. Dense indeces: if the search-key value does not appear in the index, insert it
2b. Sparse indeces: if index stores an entry for each block of the file, no change needs to be made to the index unless a new block is created. In this case, the first search-key value appearing in the new block is inserted into the index.
Multi-level index insertion:
(as well as deletion) algorithms are simple extensions of the single-level algorithms
50
Secondary IndicesProblem: Find all the records whose values in a certain field
satisfy some condition
1. if field = search-key of the primary index: no problem 2. if field <> search-key of the primary index: -->secondary index
Examples:
in the account database stored sequentially by account number, we may want to find all accounts in a particular branch
as above, but we want to find all accounts with a specified balance or range of balances
Secondary index: an index record for each search-key value
index record points to a bucket that contains pointers to all the actual records with that particular search-key value
51
Brighton A217 750Downtown A101 500Downtown A110 600Linberg A215 700Perryton A123 400Perryton A130 900Perryton A130 700Redwood A213 700
Roundhill A334 300
350
400
900
750
700
600
500
Secondary index on balance field of accounts
52
Primary vs. Secondary indices
•Secondary indices have to be dense
•Indices offer substantial benefits when searching for records
•When a file is modified, every index on the file must be updated.
•Updating indices imposes overhead on database modification.
•Sequential scan using primary index is efficient.
•Sequential scan using secondary index is expensive: each record access may fetch a new block from disk
53
5. B+- Tree Index Files
B+- tree indices are an alternative to indexed- sequential files.
Disadvantage of indexed-sequential files:
performance degrades as file grows, since many overflow blocks get created. Periodic reorganization of entire file is required.
Advantage of B+-tree index file:
automatically reorganizes itself with small, local changes, in the face of insertions and deletions. Reorganization of entire file
is not required to maintain performance.
Disadvantage of B+-tree index file:
extra insertion and deletion overhead, space overhead
54
A B+-tree is a rooted tree satisfying the following properties:
1. All paths from root to leaves are of the same length
2. Each node that is not a root or a leaf has between [n/2] and n children
3. A leaf node has between [(n-1)/2] and n-1 values
4. Special cases:
4a. If the root iss not a leaf, it has at least 2 children
4b. If the root is a leaf (it is the single node in the tree), it can have between 0 and (n-1) values
P1 K1 P2 ... Kn-1K2 Pn-1 Pn
Pi : pointers to children (for non leaf nodes) or
pointers to records or buckets of records (for leaf nodes)
Ki : the search-key values, ordered ascendently
55
Leaf Nodes in B+- trees
For j = 1, 2, …, n - 1, pointer Pj either points to a file record with
search-key value Kj, or to a bucket of pointers to file records,
each record having search-key value Kj. Only need bucket
structure if search-key does not form a primary key.
If Li and Lj are leaf nodes and i < j, then Li ‘s search-key values are
less then Lj ‘s search-key values
Pn points to next leaf node in search-key order
56
Brighton A217 750Downtown A101 500Downtown A110 600Linberg A215 700Perryton A123 400Perryton A130 900Perryton A130 700Redwood A213 700
Roundhill A334 300
Brighton Downtown Next leaf
57
Non-Leaf Nodes in B+ - Trees
Non leaf nodes form a multi-level sparse index on the leaf nodes.
For a non-leaf node with m pointers:
1. All the search-keys in the subtree to which P1 points are less than K1
2. For 2 j n -1, all the search-keys in the subtree to which Pj points have values greater than or equal to Kj-1 and less than Kj
3. All the search-keys in the subtree to which Pm points are greater or equal to Km-1
P1 K1 P2 ... Km-1K2 Pm-1 Pm
58
Example of a B+ - Trees
Brighton
Downtown
Linberg
Perriridge
Redwood
Roundhill
Linberg Redwood
Perryridge
N=3
59
Brighton Downtown LinbergRoundhillRedwoodPerryridge
Perryridge
Example of a B+ - Trees N=5
Leaf nodes must have between 2 and 4 values ( [(n-1)/2] and n-1 )
Non-leaf nodes other than root must have between 3 and 5 childrren
Root must have at least 2 children
60
Observations about B+ - Trees
Since the internode connections are done by pointers, there is no assumption that in the tree the “logically” close blocks are “physically” close.
The non-leaf levels of the B+-tree form a hierarchy of sparse indices
The B+-tree contains a relatively small number of levels thus searches can be conducted efficiently.
Insertions and deletions to the main file can be handled efficiently, as the index can be restructured in logarithmic time
61
6. Static Hashing
A bucket is a unit of storage containing one or more records (typically a disk block).
In a hash file organization we obtain the the bucket of a record directly from its search-key value using a hash function.
Hash function h is a function from the set of all search-key values K to the set of all bucket addresses B.
Records with different search-key values may be mapped to the same bucket; thus entire bucket has to be searched sequentially to locate a record.
Worst hash function maps all search-key values to the same bucket; so, access time is proportional to the number of search-key values in the file.
62
An ideal hash function is uniform, i.e. each bucket is assigned the same number of search-key values from the set of all possible values.
Ideal hash function is random, so each bucket will have the same number of records assigned to it irrespective of the actual distribution of searching-key values in the file.
Typical hash functions perform computation on the internal binary representation of the search-key.
Examples: mod p, p prime; folding; adding;
Bucket overflow can occur because of insufficient buckets or skew in distribution of records (multiple records have same search-key value or has function is non-uniform). It is handled by using ovderflow buckets, usually chained in a linked list
63
Hash indices
Hashing can be used not only for file organization but also for index-structure creation.
A hash index organizes the search keys with their associated record pointers into a hash file structure.
Hash indices are always secondary indices:
if the file itself is organized using hashing, a separate primary index on it using the same search-key is unnecessary.
However, we use the term hash index to refer to both
secondary index structures and hash organized files
64
Several kinds of uniform hashing function are in use.
1. Direct hashing:
the key is the address without any algorithmic manipulation.
The data structure must therefore contain an element for every
possible key.
While the situations where direct hasing are limited, when it can
be used it is very powerful becasue it guarantees that
there are no collisions.
Limitations: Large key value.
Hashing Functions
65
2. Mid-Square (middle of Square)
9452 * 9452 = 89340304 = 3403
As a variation on the midsqaure method, we can select a
portion of the key, such as the middle three digits, and then
use them rather than the whole key. This allows the method
to be used when the key is too large to square.
379452: 379 * 379 = 143641 = 364
121267: 121 * 121 = 014641 = 464
66
3. Modulo-Division
Also known as Division-remainder.
Address = Key MOD Table_size
While this algorithm works with any table size,
a list size that is a prime number produces fewer collisions
than other list sizes.
Tabele size =11
keys: 4 7 12 33 64 75 89
addresses: 4 7 1 0 9 9 1
colissions
67
4. Folding
There are two folding methods that are used,
fold shift and fold boundary.
In fold shift, the key value is divided into parts whose size
matches the size of the required address. Then the left and
right parts are shifted and added with the middle part.
In fold boundary, the left and right numbers are folded on a fixed
boundary between them and the center number.
a. Fold Shift Key: 123456789
123
456
789
---
1368 ( 1 is discarded)
b. Fold Boundary Key: 123456789
321 (digit reversed)
456
987 (digit reversed)
---
1764 ( 1 is discarded)
68
5. Digit-Extraction
Using digit extraction, selected digits are extracted from the
key and used as the address.
For example, using a six-digit employee number to hash to
a three-digit address(000-999), we could select the first,
third. and fourth digits (from left) and use them as the
address.
379452 = 394
121267 = 112
6. Non-Numeric Keys
69
•With the exception of the direct method, none of the methods used for
hashing are one-to-one mapping. This means that when we hash a new
key to an address, we may create a collision.
•There are several methods for handling collisions, each of them
independent of the hashing algorithm.
•Before we discuss the collision resolution methods, we need to cover
few basic concepts:
Load Factor
The load Factor alpha of a hash table of size M with N
occupied entries is defined by
alpha = N/M
Collision Resolution
70
Some hashing algorithms tend to cause data to group within the list. This tendency of data to build up enevenly across a hashed table is known as clustering.
Clustering
2. Secondary Clustering:
Secondary clustering occurs when data become grouped
along a collision path throughout the list.
1. Primary Clustering:
Primary clustering occurs when data become
clustered around a home address.
71
The first collision resolution method, open addressing, resolves
collisions in the home area. When a collision occurs, the home area
addresses are searched for an open or unoccupied element where the
new data can be placed.
Examples of Open Addressing Methods:
1. Linear Probe i = H(key) is the home address.
If it is available we store the record, otherwise, we increase i by k,
i = (i + k) mod M (k = 1, 2, 3, ...).
Linear probing gives rise to a phenomenon called primary clustering.
Open Addressing
72
2. Quadratic Probe
If there is a collision at hash address h,
this method probes the table at locations h+1, h+4, h+9, ...,
that is, at locations h + i^2 (mod tablesize) for i = 1, 2, ....
That is, the increment function is i^2.
• Quadratic probing substantially reduces clustering, but it is
not obvious that it will probe all locations in the table, and
in fact it does not.
• For some values of hash_size the function will probe
relatively few position in the table.
73
3. Double Hashing
• Double Hashing uses nonlinear probing by computing
different probe increments for different keys.
• It uses two functions. The first function computes the original
address, if the slot is available (or the record is found) we stop
there, otherwise, we apply the second hashing function to
compute the step value.
i = H1 (key) to compute the home address
H2(key) = step value = Max (1, Key DIV M) MOD M
i = i + step value
we repeat this until we find a place or we find the record.
Double hashing avoids primary and secondary clustering
74
One way of resolving collisions is to maintain M linked lists,
one for each possible address in the hash table.
A key K hashes to an address i = h(k) in the table.
At address i, we find the head of a list containing all records
having keys that have hashed to i.
This list is then searched for a record containing key K.
Chaining
75
•Suppose we divide a table into M groups of records, with each
group containing exactly b records.
•Each group of b records is called a bucket.
•The hash function h(k) computes a bucket number from the key K,
and the record containing K is stored in the bucket whose bucket
number is h(K).
If a particular bucket overflows, an overflow policy is involved.
If a bucket overflows, a chaining technique can be used to link
to an "overflow" bucket.
This link can be planted at the end of the overflowed bucket.
It is convenient to keep overflow buckets on the same cylinder,
or we may have a separate cylinder for overflows.
Buckets
76
Suppose hash table T of size M has exactly N occupied entries,
so that its load factor, alpha, is N/M.
Let's now define two quantities, Cn and C'n, where
Cn is the average number of probe addresses examined
during a successful search, and where
C'n is the average number of probe examined during
an unsuccessful search.
Performance Formulas
Efficiency of Linear Probing
Successful Search = Cn = (1 + 1/(1-a))/2
Unsuccessful search = C'n = (1+(1/(1-a))^2)/2
77
Double Hashing
Successful Search = Cn = ln (1/(1-a))/a
Unsuccessful search = C'n = 1/(1-a)
Separate Chaining
Successful Search = Cn = 1 + a/2
Unsuccessful search = C'n = a
78
Index Definition in SQL
Create an index:
create index <index_name> on <relation_name> (<attribute_list>)
Example: create index b-index on branch (branch-name)
use create unique index to indirectly specify and enforce the condition that the search key is a candidate key
To drop an index:
drop index <index_name>