Distributed Collabora/ve Systems Pascal Molli Nantes University, GDD Team, LINA Polytech 2014
Distributed Collabora/ve Systems
Pascal Molli Nantes University, GDD Team, LINA
Polytech 2014
Distributed Collabora/ve Systems
• “Distributed collabora/ve systems allow users from different computers to share applica/ons and interact with each other.”
• Prasun Dewan, University of North Carolina • So many soLware matches:
– Google Docs, slides, spreadsheet, office live, – Google Drive, Dropbox – Git, SVN, Mercurial – But also, Wikis, Emails, Worflows, chats, IRC – and many others…
Engelbart 1997 : Turing Award
“For an inspiring vision of the future of interactive computing and the invention of key technologies to help realize this vision.”
« For an inspiring vision …»
• 1962 : « La complexité et l'urgence des problèmes auxquels nous devons faire face croissent plus vite que notre capacité à les comprendre et à les résoudre. C'est un problème essentiel, nous pouvons prendre des mesures stratégiques, collectivement »
• L'approche d'Engelbart : – « Augmenter les capacités intellectuelles humaines en augmentant notre intelligence collective... »
« For an inspiring vision …»
• Dans les années 60, Engelbart développe l’ou/l NLS basé sur 2 principes: 1. « Develop knowledge Collec/on: » Codifier et
structurer les connaissances. • En 1962, approche centrée autour des hypertextes
2. «Use networked improvement communi/es » • Des communautés pour développer et maintenir les connaissances.
• Créa/on du premier « groupware », collec/ciel
« … and key technologies… »
hbp://sloan.stanford.edu/MouseSite
First Shared editor
• 1962: Mother of all demo (Engelbart) – First collabora/ve editor,hbp://web.stanford.edu/dept/SUL/library/extra4/sloan/MouseSite/1968Demo.html, See clip 26
• Hangout, Wikis, Seman/c web -‐> the vision of Engelbart in progress… Hope it works for human kinds ;)
CSCW Matrix
hbp://en.wikipedia.org/wiki/Computer-‐supported_coopera/ve_work
The Time/Space Groupware Matrix
• same place colocated
different places remote
same time synchronous
different times asynchronous
face to face interactions decision rooms single display groupware shared table / wall displays roomware…
continuous task
remote interactions
communication+coordination
Saul Greenberg, University of Calgary
Group Decision Rooms • Embeds decision making process
– dedicated computer-‐based conference facility
– real /me large group support (5-‐50)
– typically facilitated – embeds a structured mee/ng
process – domain of MIS
• Typical func/on – explore unstructured problems – brainstorm ideas – organize/priori/ze results – Vo/ng… – good for brainstorming, but…
The COLAB meeting room, Xerox PARC http://www2.parc.com/istl/members/stefik/colab.htm
Saul Greenberg, University of Calgary
Single Display Groupware
• Multiple people using a single display – multiple input devices – simultaneous input – new interaction widgets
– technical issues (O/S) – conflict with conventional
applications – supporting social
conventions of simultaneous work
– mice vs. direct touch…
Edward Tse http://grouplab.cpsc.ucalgary.ca/papers/2004/04-SDGToolkit-MScThesis/SDGToolkit-MSc.pdf
Saul Greenberg, University of Calgary
Shared Table / Wall Displays
– device characteristics – social affordances of tables/wall
InteracTable and Dynawall, From the GMD Darmstadt web site on I-Land
Saul Greenberg, University of Calgary
Roomware
• Computer-‐augmented room elements – integrated desk/wall displays for collaboration – Inter-‐operation between devices
From the GMD Darmstadt web site on I-Land
Saul Greenberg, University of Calgary
The Time/Space Groupware Matrix
• same place colocated
different places remote
same time synchronous
different times asynchronous
face to face interactions
continuous task
remote interactions video conferencing instant messaging chats/muds/virtual worlds shared screens multi-user editors
communication+coordination
Saul Greenberg, University of Calgary
Video / Audio conferencing • Desktop conferencing
– bandwidth/latency issues – what is the value of talking heads?
VoiceToVideo, http://www.voicetovideo.com/images/video_lge.gif
Xerox Parc video link
Saul Greenber, University of Calgary
Rich Instant Messaging
• Can do much more than text – How does one handle complexity?
– How does one handle interruption?
Community Bar, by Gregor Mcewan, U Calgary
Saul Greenberg, University of Calgary
Chat rooms/MUDS/Virtual worlds
• Space for meeting and interacting with people – from text to 3d spaces – can move between ‘rooms’ and/or around space
– seeing/manipulating artifacts
– self-‐representation (avatars)
– community of strangers
– shared purpose…
Fred Hutchinson Cancer Research Center: Social Support for Cancer Patients
Saul Greenber, University of Calgary
Shared Screens/Windows
• Share unaltered single user applications – technical concerns
• how regions are captured/transmitted • architectural limitations • controlling input • access control…
– social limitations • turntaking • control • privacy
Richardson, T., Stafford-Fraser, Q., Wood, K. and Hopper, A.
Virtual Network Computing. IEEE Internet Computing. Vol. 2, No. 1. p33-39. January/February, 1998.
Saul Greenberg, University of Calgary
Mul/-‐user editors • True groupware for visual artifacts
– structured documents (e.g., text paper) – visual workspace (2d graphics) – awareness – conflicting actions – tight vs loose coupling – relaxed wysiwis
Saul Greenberg, University of Calgary
The Time/Space Groupware Matrix
• same place colocated
different places
remote
same time synchronous
different times asynchronous
face to face interactions
continuous task
remote interactions
communication+coordination email bulletin boards, blogs asynchronous conferencing group calendars workflow version control wikis
Saul Greenber, University of Calgary
Email � Many styles
� vanilla email � threaded mail � intelligent mail (rou/ng / sor/ng) � structured mail (by speech acts) � mul/media mail � object-‐oriented mail � distribu/on lists / elist servers
� Social � managing complexity and overloads � spam � archiving
Saul Greenberg, University of Calgary
Email – Informa/on Lens • Structured email
– messages as inherited object types
• Rules
Saul Greenberg, University of Calgary
Communal Messaging
• Many types – bulletin boards – computer conferencing
– discussion groups – blogs – e.g., Usenet
Saul Greenberg, University of Calgary
Group Calendars
� common calendar � mee/ng scheduling � resource use
� privacy � who keeps things up to date?
� how do you stop people scheduling your mee/ngs?
http://www.americusglobal.com/images/groupcalender.gif
Saul Greenberg, University of Calgary
Workflow • “Integration and harmonious adjustment of individual work efforts toward the accomplishment of a larger goal” – B. Singh
• Codified procedures and processes – PeopleSoft – forms management and routing – coordination theory (speech acts) – Notifications triggering user actions – triggering automated actions
– standard operations – exceptions management
Saul Greenberg, University of Calgary
Wikis
� Group-‐viewable / editable web site � community of strangers to community of collaborators
� culture of what is allowed vs. hard-‐coded access control
Saul Greenberg, University of Calgary
The Time/Space Groupware Matrix
• same place colocated
different places remote
same time synchronous
different times asynchronous
face to face interactions
continuous task team rooms large public displays shift work groupware project management
remote interactions
communication+coordination
Saul Greenberg, University of Calgary
Community Bulle/n Boards
• Post information from various sources to public place
– who posts? – how to personalize? – relevance?
from Multimedia Fliers, Churchill, Nelson, Denoue, Communites and Technoligies 2003
Saul Greenberg, University of Calgary
Collage, Greenberg2001
ScrumBoard
Focus on shared editors…
• Ok, many systems, many issues, many approaches -‐> focus on shared editors,
How it works?? • 2 way to build it today:
– Using Opera/onal Transforma/on approach – Using Conflict Replicated Data Types approach
OPERATIONAL TRANSFORMATIONS
Opera/onal Transforma/on (OT)
• designed for real-‐/me collabora/ve edi/ng
• G. ELLIS and S. GIBBS. J, AND REIN, G, 1989, Concurrency control in groupware systems. Proceedings of the ACM SIGMOD, 89 :399–407.
• Google Doc, google Wave,
http://en.wikipedia.org/wiki/Operational_transformation
OT
• In order to achieve high responsiveness (latence issue)
• Data (text) are replicated on all sites • Each /me a user types a character:
– It is executed locally immediately (no lock, no communica/on…)
– Broadcasted to others (oh yeah? How ?) – Received by others (eventually, but no ordering garentee not /me limit)
– Integrated (maybe modified) and re-‐executed
OT
• The system is correct if it ensures the CCI model – Causality (Lamport defini/on) – Convergence (eventual consistency) – Inten/on (not formal)
OT Causality
s1 s2 s3 op1=mkdir(/)
op1=mkdir(/) op2=mkfile(/a)
mkfile(/a): Cannot execute
mkdir(/) op1->op2 : causality established
Causality violated
OT Causality
• If op1-‐>op2 on one site (happened-‐before), it should precede on all sites
• Usually vector clocks are used to implement a causal recep/on in OT systems.
Lamport, Leslie (1978). "Time, Clocks and the Ordering of Events in a Distributed System", Communications of the ACM, 21(7), 558-565. F.Mattern, "Time and Global States of Distributed Systems", Proc. of the International Workshop on Parallel and Distributed Algorithms, Bonas, France, October 1988, North-Holland 1989.
OT and Causality s1 [0,0,0] s2
[0,0,0] s3 [0,0,0] op1=mkdir(/) [1,0,0]
op1=mkdir(/) [1,0,0] op2=mkfile(/a) [1,1,0]
mkfile(/a) [1,1,0] (not causally ready : delayed...) mkdir(/) [1,0,0]
OT and Causality
• Vector u,v – u≤v, iff qqs i, u[i]≤v[i] – u<v, iff u≤v and u≠v – u||v, iff ¬(u<v) and ¬(v<u)
• So 2 opera/ons that are not causally related are concurrent…
OT and Concurrency
s1 [0,0]
s2 [0,0]
op1=mkdir(/a) [1,0] op2=mkdir(/b) [0,1]
op1||op2 !! op2||op1 !!
OT and Convergence (or eventual consistent)
S1 « ab »
S2 « ab »
ins(x,1)= « axb »
ins(y,1)= « ayb »
ayxb axyb Divergence
OT and « Inten/on »
• Intention of an operation is the observed effect when executed on generation state
• « ab » and ins(1,x) : we can observe – ins(1,X), – ins(a<x), – ins(x<b), – ins(a<x<b)
C. Sun, X. Jia, Y. Zhang, Y. Yang, and D. Chen. Achieving convergence, causality preservation, and intention preservation in real-time coopera- tive editing systems. ACM Transactions on Computer-Human Interaction (TOCHI), 5(1) :63–108, 1998.
OT and Inten/on Preserva/on
• Intention preservation: – Effect of an operation op on all sites must be the same as the intention of op
– Effect of an operation must not change the effect of a concurrent operation
• if intention(ins(1,x))=ins(1,x) -‐> cannot preserve
• if intention(ins(1,x))=ins(a<x<b) -‐> can preserve
OT and Inten/on
OT General Idea
efect
effect
efect
efects
Ins(5,s) Ins(2,f)
effecst effects
Ins(5,s) Ins(2,f)
Op1 Op2
S1 o Op1
State S1 State S1
OT Transforma/ons
Transforma/on Func/ons
• T(op1,op2) = op'1 • Pre-‐conditions for applying T:
– op1 and op2 are concurrent (op1||op2) – op1 and op2 are defined on the same state S !
• Return op'1 – op'1 defined on S.op2 – op'1 has the same effect as op1 (intentions)
OT General Framework
• n sites, each site runs an integra/on algorithm • Causally ready opera/ons are delivered to the integra/on algorithm
• The algorithm is INDEPENDENT of shared data types, it relies on transforma/on func/ons
• The integra/on algorithm calls appropriate T when pre-‐condi/ons of T are verified
OT
• OT algorithm is proven correct if underlying transforma/on func/ons ensure at least property TP1: – op1.T(op2,op1)≡op2.T(op1,op2)
• OT algorithms requiring only TP1 are SOCT4 [Vidot00] and COT[Sun06]
M. Ressel, D. Nitsche-Ruhland, and R. Gunzenhäuser. An integrating, transformation-oriented approach to concurrency control and undo in group editors. In CSCW ’96 : Proceedings of the 1996 ACM conference on Com- puter supported cooperative work, pages 288–297, New York, NY, USA, 1996. ACM.
OT and TP1
Site 1 : user 1
abc
axbc
abc
Site 2 : user 2
ac
Del(2) Ins(2,x)
axc xac
Del(3) Ins(1,x)
Op1
Op2
Op’1
Op’2
[Ellis,Sigmod89]
[Sigmod89, TP1ko]
[Ressel96,TP1Ok]
SOCT4
• SOCT4 : – Operations are totally ordered !
• Sending an operation requires to get a timestamp
– Cannot send an operation if concurrent published operations are pending (deferred broadcast)
• (cannot commit if not up-‐to-‐date)
N. Vidot, M. Cart, J. Ferrié�, and M. Suleiman. Copies convergence in a distributed real-time collaborative environment. Proceedings of the 2000 ACM conference on Computer supported cooperative work, pages 171–180, 2000.
Before integra/on
local operations waiting for broadcast
opi-1
opi
op1 op2
. . . . . . . . opL opL m1
After integration
Integration of opi
opi op opi op
op1 op2
. . . . . . . . opL
opL opi-1
opi m1 ’ ’forward transposed
local operations
op1 mop2
. . . . . . . . opL opL 1 opi’ opi-1
opi
SOCT4 principle
SO6 : SOCT4 for Version Control System
P. Molli, G. Oster, H. Skaf-Molli, and A. Imine. Using the transformational approach to build a safe and generic data synchronizer. Proceedings of the 2003 international ACM SIGGROUP conference on Supporting group work,pages 212–220, 2003.
T(opl1,op3)=opl’1 T(op3,opl1)=op’3
Opl’1 opl2 Op’3 T(opl2,op’3)=opl’2 T(op’3,opl2)=op’’3
Opl’1 Opl’2 Op’’3
Site « Hala », Ns=2, Synchronize !
Execute(op’’3) Ns=Ns+1, getOp ? No more remote op
Send(opl’1) Send(opl’2)
opl1 opl2
Log[0]=opl1,log[1]=opl2,
op3
getOp(Ns+1)=op3
OT and SOCT4
• Total ordering of opera/ons requires: – A central /mestamper :
• can be shut down or too busy – Or a consensus for delivering totally ordered /mestamps (Paxos):
• very costly, • requires sites involved in consensus to stay during consensus (pb with churn of P2P networks)
Jupiter (the one of Google doc)
• Require a central server – Clients-‐server architecture, but… – Transforma/ons done on the server + client side
• Require TP1 – S o Op1 o T(op2,op1) ≅ S o Op2 o T(op1,op2)
• Does not need Version Vectors (or just 2-‐entries vector)
Nichols, David A., et al. "High-‐latency, low-‐bandwidth windowing in the Jupiter collabora/on system." Proceedings of the 8th annual ACM symposium on User interface and soLware technology. ACM, 1995
Centralized OT
• Pro: – Quite efficient, mature technology…
• Cons: – Single point of failure (or costly consensus) – Economic intelligence, privacy… – Service Limita/on
• Try 51 connected users on Gdocs??
DECENTRALIZED OPERATIONAL TRANSFORMATION
Decentralized OT
• Other OT algorithms that require no par/cular orders on recep/on of opera/ons: – ADOPTED [Ressel96], GOTO[Sun98], SOCT2[Suleiman98]
• But they require a new property on T, TP2 • T(op3,op1.T(op2,op1))=T(op3,op2.T(op1,op2))
OT and TP2
Ressel – [CSCW96], TP1ok, but
user 3 user 1
abc
ac
abc
user 2
abyc
Ins(3,y) Del(2)
ayc ayc
Ins(2,y) Del(2)
ayxc axyc
Ins(3,x) Ins(2,x)
abc
axbc
Ins(2,x) Op1 Op2 Op3
Op’3 Op’2
Op’1 Op’’1
Sun – [TOCHI98]
Site 1 : user 1 Site 2 : user 2
abc
ac
abc
Site 3 : user 3
abyc
Ins(3,y) Del(2)
ayc ayc
Ins(2,y) Del(2)
ayyc ayyc
Ins(3,y) Ins(2,y)
abc
aybc
Ins(2,y) Op1 Op2 Op3
Op’3 Op’2
Op’1 Op’’1
Imine – [ECSCW03]
user 3 user 1
abc
ac
abc
user 2
abyc
Ins(3,3,y) Del(2)
ayc ayc
Ins(2,3,y) Del(2)
axyc axyc
Ins(2,2,x) Ins(2,2,x)
abc
axbc
Ins(2,2,x) Op1
Op2 Op3
Op’3 Op’2
Op’1 Op’’1
Ins(2,2,x)
Ins(2,2,x)
Ins(2,2,x)
Ins(2,2,x)
TP2 verifica/on
• TP2 quite difficult to verify «by hand» • Using theorem prover system to validate TP2 • Specify T in SPIKE, Generate counter example in case of viola/on...
• We proved all exis/ng T are false. – [Ressel96,Sun98,Suleiman97,Li04,Imine03...]
• Do func/ons ensuring TP2 exist ? A. Imine, P. Molli, G. Oster, and M. Rusinowitch. Proving Correctness of Transformation Functions in Real-Time Groupware. Ecscw 2003 : Proceed- ings of the Eighth European Conference on Computer Supported Coopera- tive Work, 14-18 September 2003, Helsinki, Finland, 2003.
Site 1 “abc”
“axbc”
Site 2 “abc”
“ac”
Site 3 “abc”
“abyc”
Common Problem (“false-‐/e”)
“axyc”? “ayxc”?
op’1=Insert(2,x) op’3=Insert(2,y)
op1=Insert(2,x) op2=Delete (2,b) op3=Insert(3,y)
TTF Approche [CollCom06]
• Keep “tombstones” of deleted elements
a View b y c Insert(3,y)
Model h a b Insert(5,y) n y c
view2model()
G. Oster, P. Urso, P. Molli, and A. Imine. Tombstone transformation functions for ensuring consistency in collaborative editing systems. In The Second International Conference on Collaborative Computing : Networking,Applications and Worksharing (CollaborateCom 2006), Atlanta, Georgia,USA, November 2006. IEEE Press.
Tombstone Transforma/on Func/ons (TTF)
T(Insert(p1,el1,sid1), Insert(p2,el2,sid2)){ if(p1<p2) return Insert(p1,el1,sid1) else if(p1=p2 and sid1<sid2) return Insert(p1,el1,sid1) else return Insert(p1+1,el1,sid1)
} T(Insert(p1,el1,sid1), Delete(p2,el2,sid2)){
return Insert(p1,el1,sid1) } T(Delete(p1,sid1), Insert(p2,sid2)){
if(p1<p2) return Delete(p1,sid1) else return Delete(p1+1,sid1)
} T(Delete(p1,sid1), Delete(p2,sid2)){
return Delete(p1,sid1) }
Compacted Storage Model (C-‐TTF)
a View b y c Insert(3,y)
Compacted Model a,2 b,3 Insert(5,y) y,5 c,6 view2model()
a View b y c Insert(3,y)
Model h a b Insert(5,y) n y c
view2model()
C-‐Model = sequence of (character, abs_pos)
Delta Storage Model (D-‐TTF)
a View b y c Insert(3,y)
Compacted Model a,2 b,3
Insert(5,y)
y,5 c,6
view2model()
a View b y c Insert(3,y)
Insert(5,y) view2model()
Delta Model
a,+2 b,+1 y,+2 c,+1 Ce,+1
D-‐Model = sequence of (character, offset)
Models Comparison • Basic Model
– Deleted characters are kept – Size of the model is growing infinitely
• Compacted Model – Update absolute posi/on of all characters located aLer the effect posi/on
• Delta Model – Update the offset of next character
• Our observa/ons – View2model can be op/mised (caret posi/on) – Overhead of view2model is not significant
Par/al Resume
• We have a solu/on with TP1 and SOCT4 – Central /mestamper or consensus – Not compa/ble with P2P – Decentraliza/on -‐> TP2
• We have now Transforma/on func/ons that sa/sfy TP2 – We need an algorithm : GOTO, SOCT2, Adopted, COT
Par/al Concurrency Problem S1 [0,0] S2 [0,0]
op1 [1,0] op2 [0,1]
op3 [0,2] op'2[1,1]
Par/al Concurrency Problem
S1 [0,0] S2 [0,0] op1 [1,0] op2 [0,1]
op3 [0,2] op'2[1,1]
Cannot Transfor
m !
On S1 : op3 // op1 but op3,op1 not defined on same state : T(op3,op1) cannot be applied !
Par/al Concurrency
• What has been done can be undone
• T-‐1(op'2,op1)=op2 • op2//op1
S1 [0,0]
op1 [1,0]
op'2[0,1]
op2[0,1]
S1 [0,0]
op'1 [1,0]
Par/al Concurrency
• op2 precedes op3 – No transformation
• op3 // op'1 and defined on the same state : – T(op3,op'1)
S1 [0,0]
op1 [1,0]
op'2[0,1]
op2[0,1]
S1 [0,0]
op'1 [1,0]
op3[0,2]
No transf
T(op3,op'1)=op'3
op'3
Respec/ng User’s Inten/on in a Par/al Concurrency Situa/on
(From J. Ferrié)
Site 1 Site 2
"telefone"
"teleone"
"telefone"
delete(5) insert(5,'p')
"telepfone"
insert(6,'h') insert(5,‘p’)
"telepone" "telephfone"
"telephone"
delete(7)
op1 op2
op3
"telefone"
"telepone"
Transpose_bk
delete(6)
insert(5, 'p')
"telephone"
insert(6,'h') Transpose_fd(delete(6), insert(6))
op1
history :
op
op2 opi opj opn
. . . . . . . opn+1
? causal
recep/on
opi op opj op
. . . . .
opn+1
op
opi op opj
op ∀ ∀
The SOCT2 Algorithm
M. Suleiman, M. Cart, and J. Ferrie. Serialization of concurrent operations in a distributed collaborative environment. In GROUP ’97 : Proceedings of the international ACM SIGGROUP conference on Supporting group work, pages 435–445, New York, NY, USA, 1997. ACM.
Correctness of TTF
➲ Reversibility property: ➲ T-‐1( T( op1 ,op2 ), op2 ) = op1
op’1=T(op1,op2)
op2 op1
op1=T-1(op’1,op2)
Inverses of Transforma/on Func/ons
➲ T-‐1(Insert(p1,el1,sid1), Insert(p2,el2,sid2)){ l if(p1<p2) return Insert(p1,el1,sid1)� l else if(p1=p2 and sid1<sid2) return Insert(p1,el1,sid1)� l else return Insert(p1-‐1,el1,sid1)� l } ➲ T-‐1(Insert(p1,el1,sid1), Delete(p2,el2,sid2)){ l return Insert(p1,el1,sid1)� l } ➲ T-‐1(Delete(p1,sid1), Insert(p2,sid2)){ l if(p1<p2) return Delete(p1,sid1)� l else return Delete(p1-‐1,sid1)� l } ➲ T-‐1(Delete(p1,sid1), Delete(p2,sid2)){ l return Delete(p1,sid1)� l }
OT and Garbage Collec/on
• Basically, each site needs to know the state of other sites – Can guess this state using received state vectors
• Each site maintains a State Vector Table – SVT[i]
• Each site maintains a Minimum State Vector – MSV = min SVT[i], for all i
• Local operation with SV < MSV => garbaged Sun, C., Jia, X., Zhang, Y., Yang, Y., and Chen, D. 1998. Achieving convergence, causality preservation, and intention preservation in real-time cooperative editing systems. ACM Trans. Comput.-Hum. Interact. 5, 1 (Mar. 1998), 63-108. DOI= http://doi.acm.org/10.1145/274444.274447
OT and Garbage Collec/on s0 s1 s2
op1[1,0,0] op2 [0,1,0]
o4[0,1,1]
o3[1,2,0]
SVT[0]=[1,0,0] SVT[1]=[0,0,0] SVT[2]=[0,0,0] MSV =[0,0,0]
SVT[0]=[1,2,1] SVT[1]=[1,2,0] SVT[2]=[0,1,1] MSV =[0,1,0]
SVT[0]=[1,1,0] SVT[1]=[0,1,0] SVT[2]=[0,0,0] MSV =[0,0,0]
SVT[0]=[1,1,1] SVT[1]=[0,1,0] SVT[2]=[0,1,1] MSV =[0,1,0]
Can garbage op2 on site0 !
OT and P2P networks • OT :
– CCI Model, Inten/ons are maintained by Transforma/on func/ons
– Generic : Integra/on algorithm independent of data type.
– Can be fully decentralized, does not support the churn...
• PB: – Mainly use State Vector (Does not scale) to detect concurrency...
– MOT2[CollCom07] is an excep/on
MOT2 : OT for P2P [CollCom07]
• MOT2 is clearly the successor of SOCT4 – Without central sites – Without state vectors
• MOT2 does not broadcast operations – MOT2 use pair-‐reconciliations
M. Cart and J. Ferrié �. Asynchronous reconciliation based on operational transformation for p2p collaborative environments. In The Third Interna- tional Conference on Collaborative Computing : Networking, Applications and Worksharing (CollaborateCom 2007), White Plains, New York, USA, November 2007. IEEE Press.
MOT2 Pair reconcilia/ons
s1 s2 s3
op1 op2 op3
Reconcile
op1+op2 op1+op2 op1+op2
Reconcile
op1+op2
op1+op2+op3 op1+op2+op3
Reconcile
op1+op2+op3 op1+op2+op3
MOT2 Scenario
S1 S2 S3 op1
Reconcile
op1 op1 op2
Reconcile
[op1,op2] [op1,op2]
op3 Reconcile
op4
[op1;op4] VS [op1,op2,op3] ???
MOT2 Scenario
S1 S2 S3
op1 Reconcile
op1 op1 op2
Reconcile
[op1,op2] [op1,op2] op3
Reconcile
op4
Stored [op1;op4;op2';op3'] Exec [op1;op4;op2';op3']
Stored [op1;op4;op2';op3'] Exec [op1;op2;op3;op4']
[op1;op4] VS [op1,op2,op3] ???
MOT2 Principle
• S1 connects to S3 for reconciliation and sends its whole log : [op1;op4]
• S3 detects op1 as common prefix and infer – op4 // [op2;op3] – S3 computes
• op4'=T(op4,[op2;op3]) • [op2';op3'] (as in SOCT4)
– [op2;op3;op4'] <=> [op4;op2';op3'] by TP1 and TP2
MOT2 idea
• MOT2 maintains a totally ordered history !! • op4 must appear in first after the common prefix (S1<S3)
• So both sites will store [op1;op4;op2';op3'] – S3 executes op4', stores [op1;op4;op2';op3'] and sends [op2';op3'] to site1 to be immediately executed
– and S1 executes [op2';op3'] and stores [op1;op4;op2';op3']
MOT2 Scenario
S1 S2 S3
op1 Reconcile
op1 op1 op2
Reconcile
[op1,op2] [op1,op2] op3
Reconcile
op4
Stored [op1;op4;op2';op3'] Exec [op1;op4;op2';op3']
Stored [op1;op4;op2';op3'] Exec [op1;op2;op3;op4']
MOT2
• A very nice algorithm • MOT2 requires TP1 and TP2 (CE in CollabCom08) • Generic, no state vectors, no central sites • Only PB :
– No garbage of the log today – Log grows infinitely – In the worst case, en/re Log must be sent at each reconcilia/on
– Communica/on cost grows infinitely
Conclusions on OT
• Quite mature technology with centralized server
• Decentralized approaches more difficult, • PB:
– Complexity is located on recep/on – What about an op generated 1 /me and integrated 1M /me ??
– Require to detect concurrency (transforma/on of concurrent opera/ons)
FROM OT TO CONFLICT FREE REPLICATED DATATYPE (CRDT)
WOOT • Focusses on linear structures (sacrifices genericity of OT) – Linear structures : String, files, Ordered trees...
• Changes the profile of operations • ins(p<c<n): (p<c<n) are intentions to be preserved – c is the newly entry identified by wid=(siteid,lclock (++
– p and n are the wid of previous and next entry • del(c:wid) : c is marked as invisible (tombstone)
Data Consistency for P2P Collaborative Editing. Gérald Oster, Pascal Urso, Pascal Molli and Abdessamad Imine. In Proceedings of the 2006 ACM Conference on Computer Supported Cooperative Work, CSCW 2006, Banff, Alberta, Canada, November 4-8, 2006, 2006.
WOOT
OT WOOT OT WOOT
id id id
WOOT and commuta/vity
➲ ins//del, del//ins are commuting ➲ del//del are commuting ➲ Ins//ins are not commuting...
WOOT and commuta/vity
WOOT and posets
➲ Each ins generates 2 relations ➲ Each site maintains a po-‐set
WOOT and lineariza/on
• Each site has to linearize this partial order ! but...
• Each site can receive operation in different order...
• Linearization must be « monotonic »
WOOT Lineariza/on
Lineariza/on viola/on
<id when choice is arbitrary
No problem
Same arbitrary choice -‐> diverge
WOOT Idea
➲ Compare incoming characters to concurrent characters in the causal order.
➲ Concurrent characters = characters between incoming relations
➲ Causal order = 1 happens-‐before 3 ... ➲ compare ’2’ first with ’1’ !
WOOT Algorithm
L= First rank in the causal Order
X incoming : Y in L if...
Incoming X
Y
Incoming X
Y
Incoming X
Y
Incoming X
Y
Incoming X
Y
Incoming X
Y
Incoming X
Y
X incoming: Y not in L if...
WOOT Example...
• IntegrateIns(a<2<b) : S=[3;1] • L =[1] • 2 >id 1 • IntegrateIns(1<2<b) -‐> 2 after 1
More difficult now : Cb12034Ce
More difficult now(2) : Cb120634Ce
Result : Cb1206354Ce
WOOT • Algorithm is well founded • Algorithm ensures intentions • Algorithm ensures eventual consistency • No garbage collecting solution compatible with P2P networks... – State grows infinitely..
• Implementation available at (Richard Dallaway): – https://bitbucket.org/d6y/woot – Talk on https://speakerdeck.com/d6y/woot-‐for-‐lift
CONFLICT FREE REPLICATED DATATYPE
Eventual Consistency/Strong Eventual Consistency
• Eventual delivery – An update executed on some correct replica eventually executes to all correct replica
• Termina/on – All updates terminate
• Strong Convergence – Correct replicas that have executed the same updates have equivalent state
• Eventual delivery – An update executed on some correct replica eventually executes to all correct replica
• Termina/on – All updates terminate
• Convergence – Correct replicas that have executed the same updates eventually reach equivalent state
M. Shapiro, N. Preguiça, C. Baquero, M. Zawirski.Conflict-‐free Replicated Data Types. 13th Int. Symp. on Stabiliza/on, Safety, and Security of Distributed Systems (SSS). Grenoble, France, 10-‐12 October 2011.
CmRDT and CvRDT
• 2 ways to build CRDT – CvRDT are state based : Monotonic la�ce ensure SEC
– CmRDT are op based: Commuta/vity of concurrent opera/ons ensure SEC
• Both are equivalent – One can be expressed in other – But not the same system at the end -‐> impact on architecture and performances
A comprehensive study of Convergent and Commuta/ve Replicated Data Types. Marc Shapiro, Nuno Preguiça, Carlos Baquero, Marek Zawirski. INRIA Technical Report RR-‐7506, January 2011
State Based CRDT
• Monotonic semi-‐la�ce -‐> CRDT • Ex: integer with merge=max
CvRDT: Interger + max
State Based Object
• An object is a tuple (S, s0, q, u,m) . The replica at process pi has state
• S , called its payload • the ini/al state is s0. • A client of the object may read the state of the object via query method q
• and modify it via update method u . • Method m serves to merge the state from a remote replica
State equivalence
• We define state equivalence s~s’ if all queries return the same result for s and s’.
• A query has no side-‐effects
Monotonic semila�ce object
• The merge opera/on of the semi-‐la�ce must be associa/ve, commuta/ve, idempotent.
CvRDT specifica/on
Exemple: G-‐set
Op-‐Based CRDT
• If all concurrent opera/ons commute, replica converge
Op-‐based object
• An op-‐based object is a tuple • (S, s0, q, t, u, P) , where S state domain, s0 ini/al state and q query method).
• an update is split into a pair (t, u) , where t is a side-‐effect-‐free prepare-‐update method and u is an effect-‐update method.
Commuta/vity
CvRDT Specifica/on
Example: Op-‐based counter
+ and – are commuta/ve in Z
CVRDT AND CMRDT EQUIVALENCE
Opera/on-‐based emula/on of state-‐based object
State based emula/on of opera/on based
Port Folio of CRDTs
• Register – Last Writer Wins – Mul/-‐value
• Set – Grow Only – 2P – Observed-‐removed
• Map – Set of Register
• Counter – Unlimited – Non-‐nega/ve
• Graph – Directed – Monotonic DAG – Edit grapg
• Sequence – Logoot, Growable Array, treedoc
CRDT FOR SEQUENCES: LOGOOT
Logoot (1/3)
Logoot: Generate Ids
Logoot (3/3)
Logoot Idea
• A document is a sequence of elements (char or lines)
• Each element has a unique id: – <p1, s1, h1><p2, s2, h2>...<pk, sk, hk>
• Pi is an integer (0 < pi < BASE) • Si is a site id • Hi is a logical clock of site i
• With lexicographic order, • Unique in space and time, • IDs set is dense.
150
Ids Doc
<0,NA,NA>
<29,NA,NA>
Ids Doc
<0,NA,NA>
<29,NA,NA>
Ins(1,A)->Insert(<15,1,1>,A) Ins(1,B)->Insert(<15,2,1>,B) Ids Doc
<0,NA,NA>
<15,1,1> A
<29,NA,NA>
Ids Doc
<0,NA,NA>
<15,2,1> B
<29,NA,NA>
Insert(<15,2,1>,B) Insert(<15,1,1>,A) Ids Doc
<0,NA,NA>
<15,1,1> A
<15,2,1> B
<29,NA,NA>
Ids Doc
<0,NA,NA>
<15,1,1> A
<15,2,1> B
<29,NA,NA>
Site 1 Site 2
Ids Doc
<0,NA,NA>
<15,1,1> A
<15,2,1> B
<29,NA,NA>
Ids Doc
<0,NA,NA>
<15,1,1> A
<15,2,1> B
<29,NA,NA>
Ins(2,X)->Insert(<15,1,1><15,1,2>,X) Ids Do
c
<0,NA,NA>
<15,1,1> A
<15,1,1><15,1,2> X
<15,2,1> B
<29,NA,NA>
Insert(<15,1,1><15,1,2>,X)
Ids Doc
<0,NA,NA>
<15,1,1> A
<15,1,1><15,1,2> X
<15,2,1> B
<29,NA,NA>
IDs can be long… • O(k) at genera/on • O(k.log(n) at integra/on • If k is small, logoot is ok
Logoot on Wikipedia pages
• Boundary = 1 000 000 • Standard pages : k minimal • Extreme pages :
– Articles : k < 2 – Most edited pages :
• 12,2 -‐> 3 • 624,9 -‐> 3,4 153
Logoot+
Strategy of IDs allocation matters !
CRDT FOR SEQUENCES: LSEQ
LSEQ Approach
• A Sequence abstract data type is a CRDT if – Any Insert(id1,x1,id2) opera/on commute with any other Insert and preserve par/al order rela/on
– Any Delete opera/on commute with another delete
– Any Insert(x1) commute with any delete(x2) if x1<>x2
• Insert(x1)|Delete(x1) do not commute, but cannot happen if causality is ensured.
155
PB: Order of Inser/ons
Typed: Q;W;E;R;T;Y Typed: Y;T;R;E;W;Q
Combine Exponen/al tree & random alloca/on
Evalua/on
Complexi/es
Ahmed-‐Nacer, M., Ignat, C. L., Oster, G., Roh, H. G., & Urso, P. (2011, September). Evalua/ng crdts for real-‐/me document edi/ng. In Proceedings of the 11th ACM symposium on Document engineering (pp. 103-‐112). ACM.
We denote by R the number of replicas and by H the number of opera/ons that had aected the document.
Complexi/es
Ahmed-‐Nacer, M., Ignat, C. L., Oster, G., Roh, H. G., & Urso, P. (2011, September). Evalua/ng crdts for real-‐/me document edi/ng. In Proceedings of the 11th ACM symposium on Document engineering (pp. 103-‐112). ACM.
Apply WOOT to a P2P Wiki : WOOKI
• WOOKI data Model: • A sequence of (idl,line content,degree,visibility) – Degree and idl are set at generation time and never change.
– Visibility is just a boolean • Storage Overhead = 1 id + 1 integer + 1 boolean by line
S. Weiss, P. Urso, and P. Molli. Wooki : a p2p wiki-based collaborative writing tool. In Web Information Systems Engineering, Nancy, France, December 2007. Springer.
WOOKI Model
• What we see : – riri – fifi – loulou
• What we store : – ((0,1),riri,0,true) – ((0,2),fifi,1,true) – ((0,3),loulou,2,true)
WOOKI Opera/ons
• Insert(idl,CP,CN,degree,data): – idl=(siteid,(Lclock++)), – CP=idl of previous line (CB if first line) – CN=idl of next line (CE if last line) – degree(c) = max(degree(CP), degree(CN)) + 1
• del(idl): – Line identified by idl is marked invisible
WOOKI Algorithm
• Insert(idl,CP,CN,degree,data) integrated only if CP,CN are existing locally...
• IntegrateIns(c, cp, cn)� – let S0 := subseq(S, cp, cn)�
• if S0 := Empty then – insert(S, c, in)�
• else – let dmin := min(degree, S0), i := 0 – let L := cpd0d1 . . . dmcn where ∀ i.degree(di) = dmin – while (i < |L| − 1) and (L|i| <id c) do i := i + 1 – IntegrateIns(c, L[i − 1], L[i])�
• endif
WOOKI
• wooki.sf.net • A P2P Wiki
– Optimized WOOT based – LpbCast + AntiEntropy support – Ensures eventual consistency and intentions
• Future work – Include awareness, undo – and much more... (‐-:
Agenda
• Introduction • RFC677 (Thomas) • OT
– SOCT4 – TP1/TP2 – MOT2
• WOOT • Comparison MOT2/WOOT • Conclusion
Compare MOT2/WOOKI on...
Concurrent opera/ons...
Line 8
Line 9
Line 10 User1
Insert(9<I<10)
user3
User2 update(9)
Line 8
Line 9
Line 10
MOT2 vs WOOKI
• MOT2 uses TTF • and WOOKI • What is the expected result ??
C. Ignat, G. Oster, P. Molli, M. Cart, J. Ferrié �, A.-M. Kermarrec, P. Sutra, M. Shapiro, L. Benmouffok, J.-M. Busca, and R. Guerraoui. A comparison of optimistic approaches to collaborative editing of wiki pages. In Interna- tional Conference on Collaborative Computing : Networking, Applications and Worksharing (CollaborateCom), number 3, White Plains, NY, USA, nov 2007.
Criteria • Merging concurrent states : Manual or automatic • Communication complexity: number of messages
exchanged for achieving convergence • Time Complexity: complexity of integration algorithm • Space Complexity: memory required by site • Convergence latency: Number of rounds necessary to
converge – Send(others),Receive(from-‐others),Process()�
• Semantic expressiveness : any operation, any operation + constraints, specific data type
• Determinism • Dynamic membership
MediaWiki
➲ The classical Wiki : Concurrent operations are merged manually.
MediaWiki characteris/cs
• m = number of sites • L = number of lines
in the doc • l = number lines
appeared in the doc • n = number of
operations • Round:
– Send(toServer) !!
MediaWikiMerging ManualMermbership N/AComm. m2+3m-2Time Comp N/ASpace Comp O(L)Convergence 2mSemantic N/ADeterminism No
Centralized vs P2P wikis
• In a centralized Wiki: – Concurrent changes are detected when a user saves a page
• In a P2P wiki: – Merging has to be done in background – 2 people can save concurrently a page (with no conflict)
– updates are propagated – and merge will be performed by the server when concurrent operations are received
MOT2 Result with TTF
Line 8
Line 9
Line 10 User1
Insert(9<l<10)
user3
User2 update(9)
Line 8
Line 9
Line l Line 10
Line 9'
MOT2
➲ m = number of sites ➲ L = number of lines in the doc
➲ l = number lines appeared in the doc
➲ n = number of operations
➲ Round: l Send(toServer) !!
MediaWiki MOT2Merging Manual AutoMermbership N/A YesComm. m2+3m-2 2(2m-3)Time Comp N/A O(n2+nm)Space Comp O(L) O(n+l)Convergence 2m 2m-3Semantic N/A Any opDeterminism No yes
WOOT result
Line 8
Line 9
Line 10 User1
Insert(9<l<10)
user3
User2 update(9)
Line 8
Line 9
Line l Line 10
Line 9'
WOOT results MediaWiki MOT2 WOOTO
Merging Manual Auto AutoMermbership N/A Yes YesComm. m2+3m-2 2(2m-3) mTime Comp N/A O(n2+nm) O(nl2)Space Comp O(L) O(n+l) O(l)Convergence 2m 2m-3 1Semantic N/A Any op Insert/deleteDeterminism No yes yes
� m = number of sites
� L = number of lines in the doc
� l = number lines appeared in the doc
� n = number of opera/ons
CRDT COUNTERS
Counters
• Replicated integer suppor/ng increment and decrement
• Value to query it • Value converge to the global number of inc minus number of decrements.
Op-‐based counter
+ and – are commuta/ve in Z
State based increment-‐only counter: G-‐Counter
Only increment -‐> a monotonic semi-‐la�ce:ok
Sean Cribbs, Berlin Buzzwords 2012
State-‐based PN-‐Counter: 2 G-‐Counter
Sean Cribbs, Berlin Buzzwords 2012
CRDT REGISTERS
Intro
• Registers – A memory cell storing an opaque object – Support assign to update its value, – Support value to query it.
• Concurrent update do not commute – LWW register: Last writer win – MV-‐Register: concurrent assignment are retained
Op-‐based LWW-‐Register
State-‐based LWW register
LWW register : state based
MV-‐Register (Dynamo)
MV-‐Register (State based)
CRDT FOR SETS
CRDT for sets
• Containers, Maps, Graphs are all based on sets.
• Consider add and remove element.. • Add(x), remove(x) does not commute
205 • ins⋄ins⋄del ≠ ins⋄del⋄ins
Set is NOT a CRDT
op1=ins(x) op2=ins(x)
site1 site2 site3
{x}
{x}
{x} {x}
{} {} {}
{}
{}
{x}
op3=del(x)
!=
Grow-‐only set: G-‐set
2P-‐SET
Cannot re-‐add removed element.
OR-‐SET
OR-‐SET
• Concurrent add s commute since each one is unique.
• Concurrent remove s commute because any common pairs have the same effect, and any disjoint pairs have independent effects.
• Concurrent add(e) and remove(f) also commute: – if e<>f they are independent, – if e = f the remove has no effect.
• And Add(e) and Remove(e) commute ?
C-‐SET
215
C-‐Set Proof
• On Site1 for an element x: (+1) + (+2) + (-‐1) + (+1) = (+3) • On Site2 for the same element x: (-‐1) + (+1) + (+2) + (+1) =(+3) • Addi/on on Z is commuta/ve ⇒ C-‐Set is CRDT
216
C-‐Set -‐ Example
• One Set • (e, counter) • insert opera/on overrides del opera/on
op1=ins(x) op2=ins(x)
site1 site2 site3
{(x,+1)}
{(x,+2)}
{(x,+1)} {(x,+1)}
{} {} {}
{(x,+1)}
{(x,0)}
{(x,+1)}
op3=del(x)
rins(x,+1)
rins(x,+1)
rins(x,+1) rdel(x,-‐1)
217
Discussion
• Tombstones exist • They disappear when the counter is 0 • No concurrent del means no tombstone • Beber space complexity than TWR
op1=ins(x) op2=ins(x)
site1 site2 site3
{(x,+1)}
{(x,+2)}
{(x,+1)} {(x,+1)}
{} {} {}
{(x,+1)} {(x,+1)}
op3=del(x)
{(x,0)}
218
Discussion
op1=ins(x)
site1 site2
{(x,+1)} op2=del(x)
{(x,0)}
{(x,+4)}
{(x,+3)}
{(x,-‐3)} {(x,-‐3)}
op3=ins(x)
{(x,+1)} op4=del(x)
{(x,0)}
{(x,+4)}
{(x,+3)}
It converge but ... Inten/on Viola/on
Eventual consistency is not sufficient
rins(x,+4) rins(x,+4)
rdel(x,-‐1) rdel(x,-‐1)
CRDT FOR GRAPHS
CRDT for Graph
• A graph is a pair of sets (V,E) such that E included in V x V .
• Any of the Set implementa/ons described above can be used for to V and E.
• What should happen upon concurrent addEdge(u, v) || removeVertex(u) ? – Can give precedence to addEdge(u,v)
2P2P graph: 2 2P-‐sets
Growable Array
Conclusions
• Distributed Collabora/ve Editors include many popular systems such as google doc, dropbox, git etc…
• From collabora/on point of view – It follows the original vision of douglas engelbart… – Edit anywhere, any /me, any kind of data in communi/es
• From distributed system point of view – It highlight problems of weak consistencies and scalability