Top Banner
New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University
28

New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Dec 26, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

New Protocols for Remote File Synchronization Based on Erasure Codes

Utku Irmak

Svilen Mihaylov

Torsten Suel

Polytechnic University

Page 2: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Outline Introduction and Common Applications Problem Formalization Contributions An Approach Based on Erasure Codes

A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes

Implementation Overview Preliminary Results Conclusions

Page 3: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Introduction

Remote File Synchronization Problem: How to update the outdated version of a file over a network with minimal amount of communication

When the versions are very similar, the total data transmitted should be significantly smaller than the file size

Machine A Machine B

Current Version Outdated Version

Page 4: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Common Applications Synchronization of User Files

Synchronization between different machines that may only be connected over over a slow network (home and work machine)

Both rsync and unison are widely used tools Web and Ftp Site Mirroring

Significant similarities between successive versions Including sites distributing new versions of a software rsync is widely used

Page 5: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Common Applications Content Distribution Networks

File synchronization is a natural approach to for updating content replicated at the network edge

Web Access over Slow Links A user revisiting a webpage may already have a previous

version in the browser cache It would be desirable to avoid the entire transmission This idea is implemented in rproxy which uses rsync

algorithm

Page 6: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Problem Formalization We have two files (strings) over some alphabet : fnew

(current file), fold (outdated file) We have two machines: C (the client), S (the server)

connected by a communication link C only has a copy of fold, and S only has a copy of fnew

Goal: Design a protocol between the parties that result C holding a copy of fnew while minimizing the total communication cost

Page 7: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Problem Formalization The communication cost should depend on the

degree of similarity between the two files The Hamming distance The edit distance The edit distance with block moves

We focus mainly on the edit distance with block moves. We assume that each block move operation adds 3 to the distance, while other operations add 1

Page 8: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Problem Formalization We focus on single-round protocols between client

and server Single-round protocols can be more easily integrated into

existing tools currently relying on rsync Multiple rounds are undesirable in many scenarios

involving small files or large latencies Multi-round protocols can introduce other complications

due to state that may have to be kept at the server for best performance

Page 9: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Assumptions The collection consists of unstructured files We are not concerned with issues of

consistency in between synchronization steps A simple two-party scenario where it is

known which files need to be updated and which is the current version

Page 10: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Contributions We describe a new approach to single-round file

synchronization based on erasure codes We derive a protocol that communicates at most

O(k lg(n) lg(n/k)) bits on files with edit distance with block moves of at most k

We derive another practical algorithm and optimized implementation that achieves very promising improvements over rsync

Page 11: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Outline Introduction and Common Applications Problem Formalization Contributions An Approach Based on Erasure Codes

A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes

Implementation Overview Preliminary Results Conclusions

Page 12: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

A Simple Multi-Round Protocol Runs in a number of rounds In the first round, server partitions the file

into blocks of size bmax and sends a hash (MD5) for each block

Client attempts to match the received hashes to all possible alignments in the outdated file.

Client responds with a bit vector to notify the server which of the hashes are understood

Server repeats the process for the blocks whose hashes did not find a match

Once block size bmin is reached, the server sends all the unmatched blocks

Page 13: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

A Simple Multi-Round Protocol

Page 14: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

A Simple Multi-Round Protocol Given two files with edit distance with block moves of k, if

we choose bmax = next smaller power of 2 of n/k bmin = lg(n) hash size = 4lg(n) bits

Lemma: If we partition fnew into some number of blocks, then at most k of these blocks do not occur in fold On each level, at most k hashes do not find a match

The algorithm transmits at most O(k lg(n) lg(n/k) ) bits and correctly updates the file with probability at least 1-1/n

Page 15: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Outline Introduction and Common Applications Problem Formalization Contributions An Approach Based on Erasure Codes

A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes

Implementation Overview Preliminary Results Conclusions

Page 16: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

An Efficient Single-Round Protocol First, we define complete multi-round algorithm:

Sends hashes for all blocks

Second, we describe Systematic Erasure Code briefly

Page 17: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Erasure Code Erasure Code: Given k source

data items of size s, which are encoded into n>k encoded items of same size s.

If any n-k of the encoded items are lost they can be recovered

A systematic erasure code is the one where the encoded data items consist of k source items plus n-k additional items

Figure by Luigi Rizzo

Page 18: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

An Efficient Single-Round Protocol

Any hash value sent in the complete multi-round algorithm that would not be sent in the simple multi-round algorithm is not transmitted

Page 19: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

An Efficient Single-Round Protocol

Any hash value that would be sent by the simple multi-round algorithm is also not sent to the client, but considered lost

Page 20: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

An Efficient Single-Round Protocol

On each level there can be at most 2k lost blocks Client can recreate the entire level of hashes using the 2k

erasure hashes and recovering the lost hashes

Page 21: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

An Efficient Single-Round Protocol Theorem: Given a bound k on the edit distance between fold

and fnew, the erasure-based file synchronization algorithm correctly updates fold to fnew with probability at least 1-1/n, using a single message of O(k lg(n) lg(n/k)) bits

We note that there are highly efficient single-message protocols for estimating the file distance k

Another property of the protocol is that by broadcasting a single message, the current version can be communicated to several clients that have different outdated versions

Page 22: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Outline Introduction and Common Applications Problem Formalization Contributions An Approach Based on Erasure Codes

A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes

Implementation Overview Preliminary Results Conclusions

Page 23: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

A Practical Protocol Based on Erasure Codes Previous protocol has two main shortcomings:

The protocol requires us to estimate an upper bound on the file distance k. An underestimation would make the recovery impossible at the client

More importantly, the algorithm does not support compression of unmatched literals

To address these problems we design another erasure-based algorithm that works better in practice

Page 24: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

A Practical Protocol Based on Erasure Codes The hashes are sent from client to server For level i, mi erasure hashes are sent The server identifies the common blocks and then sends

unmatched literals in compressed form

Page 25: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Implementation Overview We included three additional optimizations over rsync :

Server now transmits the resulting delta and bit vector to allow the client create the same reference file

1) We replace gzip algorithm used for transmission of the unmatched literals and match tokens with an optimized delta compressor

Page 26: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Implementation Overview

3) We integrate decomposable hashes:

This technique allows the hash of a child block to be computed from the hashes of its parent and sibling, halving the number of erasure hashes transmitted

2) We make a better choice of the number of bits per hash:

We assume some upper bound on the probability of a collision, say 1/2^d for some d, then we use lg(n)+lg(y)+d bits per hash

n is the file size

y is the total number of hashes sent from client to server

Page 27: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Preliminary Results For the experiments we used the gcc and emacs datasets,

consisting of 2.7.0 and 2.7.1 of gcc and 19.28 and 19.29 of emacs

Page 28: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Conclusions We have described a new approach to remote

file synchronization based on erasure codes Using this approach, we derived a single-

round protocol that is feasible and communication efficient w.r.t a common file distance measure