Contribution to the Design & Implementation of the Highly Available Scalable and Distributed Data Structure: LH* RS Rim Moussa Rim Moussa [email protected][email protected]http://ceria.dauphine.fr/rim/ http://ceria.dauphine.fr/rim/ rim.html rim.html Thesis Presentation in Computer Science *Distributed Databases Thesis Supervisor: Pr. Witold Litwin Examinators: Pr. Thomas J.E. Schwarz Pr. Toré Risch Jury President: Pr. Gérard Lévy Paris Dauphine University *CERIA Lab. *04th October 2004
Paris Dauphine University *CERIA Lab. *04th October 200 4. Contribution to the Design & Implementation of the Highly Available Scalable and Distributed Data Structure: LH* RS. Rim Moussa [email protected] http://ceria.dauphine.fr/rim/rim.html. Thesis Supervisor: Pr. Witold Litwin - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
kk = 0 ** = 0 ** kk = 1 = 1 Perf. Degradation of 37% Perf. Degradation of 37%
kk = 1 ** = 1 ** kk = 2 = 2 Perf. Degradation of 10% Perf. Degradation of 10%
4,349s
6,940s7,720s
0
2
4
6
8
10
0 5000 10000 15000 20000 25000
Number of Inserted Keys
File Creation Time
(sec)
k = 0k = 1k = 2
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 31
Outline…1. Issue
2. State of the Art
3. LH*RS Scheme
4. LH*RS Manager
5. Experimentations
6. File Creation
7. Bucket RecoveryScenarioPerformances
8. Parity Bucket Creation
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 32
Failure Detection
Are you Alive?
Data Buckets
Parity Buckets
Scenario
Coordinator
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 33
Waiting for Responses …
OK
Data Buckets
Parity Buckets
Scenario (2)
OK OKOK
Coordinator
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 34
Searching Spare Buckets …
Wanna be
Spare ?
Scenario (3)
Multicast Group of Blank Data Buckets
Coordinator
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 35
Waiting for Replies …
Launch UDP Listening Launch
TCP Listening, Launch Working
Thredsl
*Waiting for Confirmation* If Time-out elapsed cancel everything
I would
Scenario (4)
Multicast Group of Blank Data Buckets
CoordinatorI would
I would
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 36
Spare Selection
Scenario (5)
Multicast Group of Blank Data Buckets
Confirmed
Cancellation
Confirmed
You are HiredCoordinator
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 37
Parity Buckets
Recover Failed Buckets
Scenario (6)
Recovery Manager Selection
Coordinator
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 38
Data Buckets
Parity Buckets
Recovery Manager
Spare Buckets
Buckets participating to Recovery
Send me Records of rank in [r, r+slice-1]
…
Scenario (7)
Query Phase
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 39
Decoding Phase
Recovered Slices
Data Buckets
Parity Buckets
Spare Buckets
Buckets participating to Recovery
Requested Buffers
…
Scenario (8)
Reconstruction Phase
Recovery Manager
In // with Query Phase
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 40
2 DBs1 DB XORConfig. 1 DB RS XOR vs. RS
Performances
File Info
File of 125 000 records
Record Size = 100 bytes
Bucket Size = 31250 records 3.125 MB
Group of 4 Data Buckets (m = 4), k-Available with k = 1,2,3
Decoding
* GF(216)
* RS+ Decoding (RS + log Pre-calculus of H-1 and OK Symboles Vector)
Recovery per Slice (adaptative to PCs storage & computing capacities)
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 41
2 DBs1 DB XORConfig. 1 DB RS XOR vs. RS
Performances
SliceTotal Time (sec)
CPU Time (sec)
Com. Time (sec)
1250 0,625 0,266 0,348
3125 0,588 0,255 0,323
6250 0,552 0,240 0,312
15625 0,562 0,255 0,302
31250 0,578 0,250 0,328
Slice (from 4% to 100% of a bucket content)
Total Time is almost constant
0,58
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 42
2 DBs1 DB XORConfig. 1 DB RS XOR vs. RS
Performances
SliceTotal Time (sec)
CPU Time (sec)
Com. Time (sec)
1250 0,734 0,349 0,365
3125 0,688 0,359 0,323
6250 0,656 0,354 0,297
15625 0,667 0,360 0,297
31250 0,688 0,360 0,328
0,67
Slice (from 4% to 100% of a bucket content)
Total Time is almost constant
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 43
2 DBs1 DB XORConfig.
Performances
Time to Recover 1DB -XOR : 0,58 sec
XOR in GF(216) realizes a gain of 13% in Total Time
(and 30% in CPU Time)
Time to Recover 1DB –RS : 0,67 sec
1 DB RS XOR vs. RS
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 44
3 DBs2 DBs SummaryXOR vs. RS1 DB RS
Performances
SliceTotal Time (sec)
CPU Time (sec)
Com. Time (sec)
1250 0,976 0,577 0,375
3125 0,932 0,589 0,338
6250 0,883 0,562 0,321
15625 0,875 0,562 0,281
31250 0,875 0,562 0,313
0,9
Slice (from 4% to 100% of a bucket content)
Total Time is almost constant
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 45
3 DBs2 DBs SummaryXOR vs. RS1 DB RS
Performances
Slice Total Time (sec)
CPU Time (sec)
Com. Time (sec)
1250 1,281 0,828 0,406
3125 1,250 0,828 0,390
6250 1,211 0,852 0,352
15625 1,188 0,823 0,361
31250 1,203 0,828 0,375
1,23
Slice (from 4% to 100% of a bucket content)
Total Time is almost constant
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 46
Performances
3 DBs2 DBs SummaryXOR vs. RS1 DB RS
fBucket
Size (MB)Total Time
(sec)
Recovery Speed
(MB/sec)
1 (XOR)1 (RS)
3,1250,58 5.38
0,67 4.66
2 6,250 0,9 6.94
3 9,375 1,23 7,62
Time to Recover f Buckets f Time to Recover 1 Bucket
Factorized Query Phase The + is Decoding Time & Time to send Recovered Buffers
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 47
Performances
GF(28)
XOR in GF(28) improves decoding perf. of 60% compared to RS in GF(28).
RS/RS+ decoding in GF(216) realize a gain of 50% compared to decoding in GF(28).
3 DBs2 DBs SummaryXOR vs. RS
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 48
Outline…
1. Issue
2. State of the Art
3. LH*RS Scheme
4. LH*RS Manager
5. Experimentations
6. File Creation
7. Bucket Recovery
8. Parity Bucket Creation
ScenarioPerformances
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 49
Scenario
Multicast Group of Blank Parity Buckets
Wanna Join Group g ?
Searching for a new Parity Bucket
Coordinator
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 50
Scenario (2)
Coordinator
I Would
Launch UDP Listening Launch
TCP Listening, Launch Working
Thredsl
*Waiting for Confirmation* If Time-out elapsed cancel everything
Waiting for Replies …
Multicast Group of Blank Parity Buckets
I Would
I Would
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 51
Scenario (3)
You are Hired
Confirmed
Cancellation
Cancellation
New Parity Bucket Selection
Multicast Group of Blank Parity Buckets
Coordinator
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 52
Send me your contents ! …
Scenario (4)
Group of Data Buckets
New Parity Bucket
…
Auto-creation *Query Phase
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 53
Requested Buffers…
Scenario (5)
Group of Data Buckets
Buffer Processing
…
Auto-creation *Encoding Phase
New Parity Bucket
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 54
Performances
Max Bucket Size : 5000 .. 50000 records
Bucket Load Factor: 62,5%
Record Size: 100 octets
Group of 4 Data Buckets
Encoding
GF(216)
RS++ ( Log Pre-calculus & Row ‘1’s XOR encoding to Process 1st DB buffer)
XOR RS XOR vs. RSConfig.
GF(28)
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 55
Performances
Bucket Size
Total Time (sec)
CPU Time (sec)
Com. Time (sec)
5000 0.190 0.140 0.029
10000 0.429 0.304 0.066
25000 1.007 0.738 0.144
50000 2.062 1.484 0.322
XOR RS XOR vs. RSConfig.
GF(28)
Same Encoding Rate
Bucket Size: CPU Time 74% Total Time
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 56
Performances
Bucket Size
Total Time (sec)
CPU Time (sec)
Com. Time (sec)
5000 0.193 0.149 0.035
10000 0.446 0.328 0.059
25000 1.053 0.766 0.153
50000 2.103 1.531 0.322
XOR RS XOR vs. RSConfig.
GF(28)
Same Encoding Rate
Bucket Size: CPU Time 74% Total Time
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 57
Performances
XOR encoding speed : 2.062 sec
RS encoding speed: 2.103 sec
XOR realizes a performance gain in CPU time
of 5% ( only 0,02% on Total Time)
For Bucket Size = 50000 records
XOR RS XOR vs. RSConfig.
GF(28)
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 58
XOR RS XOR vs. RSConfig.
GF(28)
Performances
Idem GF(216), CPU Time = 3/4 Total Time
XOR in GF(28) improves CPU Time by 22%
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 59
Performance
File Creation Rate0.33MB/s for k = 0
0.25MB/s for k = 1
0.23MB/s for k = 2
Record Insert Time0.29ms for k = 0
0.33ms for k = 1
0.36ms for k = 2
Bucket Recovery Rate4.66MB/s from 1-unavailability
6.94MB/s from 2-unavailability
7.62MB/s from 3-unavailability
Record Recovery TimeAbout 1.3ms
Key Search TimeIndividual> 0.24ms
Bulk> 0.056ms
Wintel P4, 1.8GHz, 1Gbps
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 60
Conclusion
Experiments prove:
Optimizations
Encoding/ Decoding
Architecture
Impact on Performance
Good Recovery Performances
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 61
Future Work
Update Propagation to Parity Buckets Reliability
Performance
Reduce Coordinator Tasks
« Parity Declustering »
Investigation of New Erausure-Resilient Codes
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 62
References
[PGK88] D. A. Patterson, G. Gibson & R. H. Katz, A Case for Redundant Arrays of Inexpensive Disks, Proc. of ACM SIGMOD Conf, pp.109-106, June 1988.
[ISI81] Information Sciences Institute, RFC 793: Transmission Control Protocol (TCP) – Specification, Sept. 1981, http://www.faqs.org/rfcs/rfc793.html
[MB 00] D. MacDonal, W. Barkley, MS Windows 2000 TCP/IP Implementation Details, http://secinf.net/info/nt/2000ip/tcpipimp.html
[J88] V. Jacobson, M. J. Karels, Congestion Avoidance and Control, Computer Communication Review, Vol. 18, No 4, pp. 314-329.
[XB99] L. Xu & J. Bruck, X-Code: MDS Array Codes with Optimal Encoding, IEEE Trans. on Information Theory, 45(1), p.272-276, 1999.
[CEG+ 04] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, S. Sankar, Row-Diagonal Parity for Double Disk Failure Correction, Proc. of the 3rd USENIX –Conf. On File and Storage Technologies, Avril 2004.
[R89] M. O. Rabin, Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance, Journal of ACM, Vol. 26, N° 2, April 1989, pp. 335-348.
[W91] P.E. White, RAID X tackles design problems with existing design RAID schemes, ECC Technologies, ftp://members.aol.com.mnecctek.ctr1991.pdf
[GRS97] J. C. Gomez, V. Redo, V. S. Sunderam, Efficient Multithreaded User-Space Transport for Network Computing, Design & Test of the TRAP protocol, Journal of Parallel & Distributed Computing, 40 (1) 1997.
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 63
References (2)
[BK+ 95] J. Blomer, M. Kalfane, R. Karp, M. Karpinski, M. Luby & D. Zuckerman, An XOR-Based Erasure-Resilient Coding Scheme, ICSI Tech. Rep. TR-95-048, 1995.
[LS00] W. Litwin & T. Schwarz, LH*RS: A High-Availability Scalable Distributed
Data Structure using Reed Solomon Codes, p.237-248, Proceedings of the ACM SIGMOD 2000.
[KLR96] J. Karlson, W. Litwin & T. Risch, LH*LH: A Scalable high performance data structure for switched multicomputers, EDBT 96, Springer Verlag.
[RS60] I. Reed & G. Solomon, Polynomial codes over certain Finite Fields, Journal of the society for industrial and applied mathematics, 1960.
[P97] J. S. Plank, A Tutorial on Reed-Solomon Coding for fault-Tolerance in RAID-like Systems, Software– Practise & Experience, 27(9), Sept. 1997, pp 995- 1012,
[D01] A.W. Diène, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Nov. 2001, Université Paris Dauphine.
[B00] F. Sahli Bennour, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Juin 2000, Université Paris Dauphine.
04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 64
Publications
[ML02] R. Moussa, W. Litwin, Experimental Performance Analysis of LH*RS Parity Management, Carleton Scientific Records of the 4th International Workshop on Distributed Data & Structure : WDAS 2002, p.87-97.
[MS04] R. Moussa, T. Schwarz, Design and Implementation of LH*RS – A Highly-
Available Scalable Distributed Data Structure, Carleton Scientific Records of the 6th International Workshop on Distributed Data & Structure: WDAS 2004.
[LMS04] W. Litwin, R. Moussa, T. Schwarz, Prototype Demonstration of LH*RS: A
Highly Available Distributed Storage System, Proc. of VLDB 2004 (Demo Session) p.1289-1292.
[LMS04-a] W. Litwin, R. Moussa, T. Schwarz, LH*RS: A Highly Available
Distributed Storage System, journal version submitted, under revision.
Thank You For Your Thank You For Your AttentionAttention