C. Faloutsos 15-826 1 CMU SCS 15-826: Multimedia Databases and Data Mining Lecture#3: Primary key indexing – hashing C. Faloutsos CMU SCS 15-826 Copyright: C. Faloutsos (2014) 2 Reading Material • [Litwin] Litwin, W., (1980), Linear Hashing: A New Tool for File and Table Addressing, VLDB, Montreal, Canada, 1980 • textbook, Chapter 3 • Ramakrinshan+Gehrke, Chapter 11 CMU SCS 15-826 Copyright: C. Faloutsos (2014) 3 Outline Goal: ‘Find similar / interesting things’ • Intro to DB • Indexing - similarity search • Data Mining
28
Embed
15-826: Multimedia Databases and Data Miningchristos/courses/826.F14/FOILS-pdf/030_hashing.pdf · C. Faloutsos 15-826 5 CMU SCS 15-826 Copyright: C. Faloutsos (2014) 13 Design decisions
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
C. Faloutsos 15-826
1
CMU SCS
15-826: Multimedia Databases and Data Mining
Lecture#3: Primary key indexing – hashing C. Faloutsos
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 2
Reading Material
• [Litwin] Litwin, W., (1980), Linear Hashing: A New Tool for File and Table Addressing, VLDB, Montreal, Canada, 1980
• Motivation • main idea • search algo • insertion/split algo • deletion • performance analysis • variations
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 56
Linear hashing - insertion? Algo: insert key ‘k’
• compute appropriate bucket ‘b’
• if the overflow criterion is true
• split the bucket of ‘split-ptr’
• split-ptr ++ (*)
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 57
Linear hashing - insertion?
notice: overflow criterion is up to us!! Q: suggestions?
C. Faloutsos 15-826
20
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 58
Linear hashing - insertion?
notice: overflow criterion is up to us!! Q: suggestions? A1: space utilization >= u-max
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 59
Linear hashing - insertion?
notice: overflow criterion is up to us!! Q: suggestions? A1: space utilization > u-max A2: avg length of ovf chains > max-len A3: ....
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 60
Linear hashing - insertion? Algo: insert key ‘k’
• compute appropriate bucket ‘b’
• if the overflow criterion is true
• split the bucket of ‘split-ptr’
• split-ptr ++ (*)
what if we reach the right edge??
C. Faloutsos 15-826
21
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 61
Linear hashing - split now? h0(x) = x mod N (for the un-split buckets) h1(x) = x mod (2*N) for the splitted ones)
split ptr
0 1 2 3 4 5 6
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 62
Linear hashing - split now? h0(x) = x mod N (for the un-split buckets) h1(x) = x mod (2*N) (for the splitted ones)
split ptr
0 1 2 3 4 5 6 7
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 63
Linear hashing - split now? h0(x) = x mod N (for the un-split buckets) h1(x) = x mod (2*N) (for the splitted ones)
split ptr
0 1 2 3 4 5 6 7
C. Faloutsos 15-826
22
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 64
Linear hashing - split now? h0(x) = x mod N (for the un-split buckets) h1(x) = x mod (2*N) (for the splitted ones)
split ptr
0 1 2 3 4 5 6 7
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 65
Linear hashing - split now?
split ptr
0 1 2 3 4 5 6 7
this state is called ‘full expansion’
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 66
Linear hashing - observations
In general, at any point of time, we have at most two h.f. active, of the form:
• hn(x) = x mod (N * 2n)
• hn+1(x) = x mod (N * 2n+1)
(after a full expansion, we have only one h.f.)
C. Faloutsos 15-826
23
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 67
Linear hashing - overview
• Motivation • main idea • search algo • insertion/split algo • deletion • performance analysis • variations
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 68
Linear hashing - deletion?
• reverse of insertion:
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 69
Linear hashing - deletion?
• reverse of insertion: • if the underflow criterion is met
– contract!
C. Faloutsos 15-826
24
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 70
Linear hashing - how to contract?
h0(x) = mod N (for the un-split buckets) h1(x) = mod (2*N) (for the splitted ones)
split ptr
0 1 2 3 4 5 6
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 71
Linear hashing - how to contract?
h0(x) = mod N (for the un-split buckets) h1(x) = mod (2*N) (for the splitted ones)
split ptr
0 1 2 3 4 5
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 72
Linear hashing - overview
• Motivation • main idea • search algo • insertion/split algo • deletion • performance analysis • variations
C. Faloutsos 15-826
25
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 73
Linear hashing - performance
• [Larson, TODS 1982] search-time
(avg # of d.a.)
split: if u>u0
(say u0=.85)
# records R 2R
1.01 d.a.
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 74
Linear hashing - performance
• [Larson, TODS 1982] search-time
(avg # of d.a.)
split: if u>u0
(say u0=.85)
# records R 2R
1.01 d.a. ??
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 75
Linear hashing - performance
• [Larson, TODS 1982] search-time
(avg # of d.a.)
split: if u>u0
(say u0=.85)
# records R 2R
1.01 d.a.
??
C. Faloutsos 15-826
26
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 76
Linear hashing - performance
• [Larson, TODS 1982] search-time
(avg # of d.a.)
split: if u>u0
(say u0=.85)
# records R 2R
1.01 d.a. ??
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 77
Linear hashing - performance
• [Larson, TODS 1982] search-time
(avg # of d.a.)
split: if u>u0
(say u0=.85)
# records R 2R
1.01 d.a.
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 78
Linear hashing - performance
• [Larson, TODS 1982] search-time
(avg # of d.a.)
split: if u>u0
(say u0=.85)
# records R 2R
eg., 1.01 d.a.
eg., 1.3 d.a.
C. Faloutsos 15-826
27
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 79
Linear hashing - overview
• Motivation • main idea • search algo • insertion/split algo • deletion • performance analysis • variations
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 80
Other hashing variations
• ‘order preserving’ • ‘perfect hashing’ (no collisions!) [Ed. Fox,
et al]
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 81
Primary key indexing - conclusions
• hashing is O(1) on the average for search • linear hashing: elegant way to grow a hash
table • B-trees: industry work-horse for primary-
key indexing (O(log(N) w.c.!)
C. Faloutsos 15-826
28
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 82
References for primary key indexing
• [Fagin+] Ronald Fagin, Jürg Nievergelt, Nicholas Pippenger, H. Raymond Strong: Extendible Hashing - A Fast Access Method for Dynamic Files. TODS 4(3): 315-344(1979)
• [Fox] Fox, E. A., L. S. Heath, Q.-F. Chen, and A. M. Daoud. "Practical Minimal Perfect Hash Functions for Large Databases." Communications of the ACM 35.1 (1992): 105-21.
CMU SCS
15-826 Copyright: C. Faloutsos (2014) 83
References, cont’d
• [Knuth] D.E. Knuth. The Art Of Computer Programming, Vol. 3, Sorting and Searching, Addison Wesley
• [Larson] Per-Ake Larson Performance Analysis of Linear Hashing with Partial Expansions ACM TODS, 7,4, Dec. 1982, pp 566--587
• [Litwin] Litwin, W., (1980), Linear Hashing: A New Tool for File and Table Addressing, VLDB, Montreal, Canada, 1980