Unicamp MC714 Distributed Systems Slides by Maarten van Steen, adapted from Distributed Systems, 3rd edition Chapter 04: Communication
Unicamp MC714Distributed Systems
Slides by Maarten van Steen, adapted fromDistributed Systems, 3rd edition
Chapter 04: Communication
Revision:
Revision: Threads and Distributed Systems
Improve performance
Starting a thread is typically much cheaper than starting a new process.Having a single-threaded server prohibits simple scale-up to amultiprocessor system.As with clients: hide network latency by reacting to next request whileprevious one is being replied.
Better structure
Most servers have high I/O demands. Using simple, well-understoodblocking calls simplifies the overall structure.Multithreaded programs tend to be smaller and easier to understand dueto simplified flow of control.
2 / 62
Revision:
Revision: Ways of virtualization
(a) Process VM, (b) Native VMM, (c) Hosted VMM
Runtime system
Application/Libraries
Hardware
Operating system
Application/Libraries
Virtual machine monitor
Hardware
Operating system Virtual machine monitor
Application/Libraries
Hardware
Operating system
Operating system
(a) (b) (c)
Differences(a) Separate set of instructions, an interpreter/emulator, running atop an OS.(b) Low-level instructions, along with bare-bones minimal operating system(c) Low-level instructions, but delegating most work to a full-fledged OS.
3 / 62
Revision:
Revision: Servers and state
Stateless serversNever keep accurate information about the status of a client after havinghandled a request:
Don’t record whether a file has been opened (simply close it again afteraccess)Don’t promise to invalidate a client’s cacheDon’t keep track of your clients
Consequences
Clients and servers are completely independentState inconsistencies due to client or server crashes are reducedPossible loss of performance because, e.g., a server cannot anticipateclient behavior (think of prefetching file blocks)
QuestionDoes connection-oriented communication fit into a stateless design?
4 / 62
Revision:
Revision: Servers and state
Stateless serversNever keep accurate information about the status of a client after havinghandled a request:
Don’t record whether a file has been opened (simply close it again afteraccess)Don’t promise to invalidate a client’s cacheDon’t keep track of your clients
Consequences
Clients and servers are completely independentState inconsistencies due to client or server crashes are reducedPossible loss of performance because, e.g., a server cannot anticipateclient behavior (think of prefetching file blocks)
QuestionDoes connection-oriented communication fit into a stateless design?
4 / 62
Revision:
Revision: Servers and state
Stateless serversNever keep accurate information about the status of a client after havinghandled a request:
Don’t record whether a file has been opened (simply close it again afteraccess)Don’t promise to invalidate a client’s cacheDon’t keep track of your clients
Consequences
Clients and servers are completely independentState inconsistencies due to client or server crashes are reducedPossible loss of performance because, e.g., a server cannot anticipateclient behavior (think of prefetching file blocks)
QuestionDoes connection-oriented communication fit into a stateless design?
4 / 62
Revision:
Revision: Server Clusters
Common organization
Logical switch
(possibly multiple)
Application/compute servers Distributed
file/database
system
Client requests
Dispatched
request
First tier Second tier Third tier
Crucial elementThe first tier is generally responsible for passing requests to an appropriateserver: request dispatching
5 / 62
Revision:
Revision: Request Handling
ObservationHaving the first tier handle all communication from/to the cluster may lead to abottleneck.
A solution: TCP handoff
SwitchClient
Server
Server
RequestRequest
(handed off)
ResponseLogically asingle TCPconnection
6 / 62
Revision:
Models for code migration
Before execution After executionClient Server Client Server
CScode
exec
resource
code
exec*
resource
REVcode
−→ exec
resource
−→code
exec*
resource
CS: Client-Server REV: Remote evaluation
7 / 62
Revision:
Models for code migration
Before execution After executionClient Server Client Server
CoD exec
resource
←−code code
exec*
resource
←−
MAcode
exec
resource
−→resource resource
−→code
exec*
resource
CoD: Code-on-demand MA: Mobile agents
8 / 62
Revision:
Strong and weak mobility
Object components
Code segment: contains the actual code
Data segment: contains the state
Execution state: contains context of thread executing the object’s code
Weak mobility: Move only code and data segment (and reboot execution)
Relatively simple, especially if code is portable
Distinguish code shipping (push) from code fetching (pull)
Strong mobility: Move component, including execution state
Migration: move entire object from one machine to the other
Cloning: start a clone, and set it in the same execution state.
9 / 62
Revision:
Revisao: Exercıcios
1 Considere um servico que leva um total de 10 ms para atender umpedido desde que os dados necessarios estejam em uma cachena memoria principal. Nos casos onde os dados nao estao nacache, uma operacao de disco que leva 90 ms e necessaria antesde completar o pedido, e durante este tempo a thread queprocessa o pedido e suspensa. Assuma que os dados estao nacache para 50% dos pedidos. Quantos pedidos por segundo oservidor pode tratar se for implementado com uma unica thread? Ese o servidor usar multiplas threads?
2 Faz sentido limitar o numero de threads em um processo servidor?Argumente.
3 Existem casos onde um servidor single-thread tem desempenhomelhor do que um servidor multi-thread? Argumente.
10 / 62
Revision:
Revisao: Exercıcios
4 Um servidor multi-processos tem algumas vantagens edesvantagens quando comparado com um servidor multi-threads.De alguns exemplos.
5 Um servidor que mantem uma conexao TCP/IP para um cliente estateful ou stateless?
11 / 62
Exercises:
Exercıcios
1 Descreva o processo de conexao entre cliente e servidor comsockets TCP/IP.
2 Diferencie comunicacao sıncrona e assıncrona, persistente etransiente. De exemplos de cada combinacao.
3 Descreva um problema de escalabilidade com comunicacaosıncrona transiente.
4 Qual e o papel de um broker na comunicacao orientada amensagens?
12 / 62
Exercises:
Exercıcios
5 Na Figura 4.35, qual e o fator de stretch da rede de overlay na rotaA→C?
6 Explique o princıpio de anti-entropia usado em protocolosepidemicos.
7 Descreva o problema de remocao de dados em protocolosepidemicos e apresente uma solucao.
8 Descreva um algoritmo epidemico que calcule o tamanho de umarede.
13 / 62
Communication: Foundations Layered Protocols
Basic networking model
Physical
Data link
Network
Transport
Session
Application
Presentation
Application protocol
Presentation protocol
Session protocol
Transport protocol
Network protocol
Data link protocol
Physical protocol
Network
1
2
3
4
5
7
6
DrawbacksFocus on message-passing onlyOften unneeded or unwanted functionalityViolates access transparency
The OSI reference model 14 / 62
Communication: Foundations Layered Protocols
Low-level layers
Recap
Physical layer: contains the specification and implementation of bits, andtheir transmission between sender and receiverData link layer: prescribes the transmission of a series of bits into a frameto allow for error and flow controlNetwork layer: describes how packets in a network of computers are to berouted.
ObservationFor many distributed systems, the lowest-level interface is that of the networklayer.
The OSI reference model 15 / 62
Communication: Foundations Layered Protocols
Transport Layer
Important
The transport layer provides the actual communication facilities for mostdistributed systems.
Standard Internet protocols
TCP: connection-oriented, reliable, stream-oriented communicationUDP: unreliable (best-effort) datagram communication
The OSI reference model 16 / 62
Communication: Foundations Layered Protocols
Middleware layer
ObservationMiddleware is invented to provide common services and protocols that can beused by many different applications
A rich set of communication protocols(Un)marshaling of data, necessary for integrated systemsNaming protocols, to allow easy sharing of resourcesSecurity protocols for secure communicationScaling mechanisms, such as for replication and caching
NoteWhat remains are truly application-specific protocols... such as?
Middleware protocols 17 / 62
Communication: Foundations Layered Protocols
An adapted layering scheme
Hardware
Middleware
ApplicationApplication protocol
Middleware protocol
Host-to-host protocol
Network
Operatingsystem
Physical/Link-level protocol
Middleware protocols 18 / 62
Communication: Foundations Types of Communication
Types of communication
Distinguish...
Client
Server
Synchronize after processing by server
Synchronize at request delivery
Synchronize at request submission
Request
Reply
Storage facility
Transmission interrupt
Time
Transient versus persistent communicationAsynchronous versus synchronous communication
19 / 62
Communication: Foundations Types of Communication
Types of communication
Transient versus persistent
Client
Server
Synchronize after processing by server
Synchronize at request delivery
Synchronize at request submission
Request
Reply
Storage facility
Transmission interrupt
Time
Transient communication: Comm. server discards message when itcannot be delivered at the next server, or at the receiver.Persistent communication: A message is stored at a communicationserver as long as it takes to deliver it.
20 / 62
Communication: Foundations Types of Communication
Types of communication
Places for synchronization
Client
Server
Synchronize after processing by server
Synchronize at request delivery
Synchronize at request submission
Request
Reply
Storage facility
Transmission interrupt
Time
At request submissionAt request deliveryAfter request processing
21 / 62
Communication: Foundations Types of Communication
Client/Server
Some observationsClient/Server computing is generally based on a model of transientsynchronous communication:
Client and server have to be active at time of communicationClient issues request and blocks until it receives replyServer essentially waits only for incoming requests, and subsequentlyprocesses them
Drawbacks synchronous communication
Client cannot do any other work while waiting for replyFailures have to be handled immediately: the client is waitingThe model may simply not be appropriate (mail, news)
22 / 62
Communication: Foundations Types of Communication
Client/Server
Some observationsClient/Server computing is generally based on a model of transientsynchronous communication:
Client and server have to be active at time of communicationClient issues request and blocks until it receives replyServer essentially waits only for incoming requests, and subsequentlyprocesses them
Drawbacks synchronous communication
Client cannot do any other work while waiting for replyFailures have to be handled immediately: the client is waitingThe model may simply not be appropriate (mail, news)
22 / 62
Communication: Message-oriented communication Simple transient messaging with sockets
Transient messaging: sockets
Berkeley socket interface
Operation Descriptionsocket Create a new communication end pointbind Attach a local address to a socketlisten Tell operating system what the maximum number of pending
connection requests should beaccept Block caller until a connection request arrivesconnect Actively attempt to establish a connectionsend Send some data over the connectionreceive Receive some data over the connectionclose Release the connection
connect
socket
socket
bind listen receive
receive
send
send
accept close
close
Server
Client
Synchronization point Communication
23 / 62
Communication: Message-oriented communication Simple transient messaging with sockets
Sockets: Python code
Server
1 from socket import *2 s = socket(AF_INET, SOCK_STREAM)3 s.bind((HOST, PORT))4 s.listen(1)5 (conn, addr) = s.accept() # returns new socket and addr. client6 while True: # forever7 data = conn.recv(1024) # receive data from client8 if not data: break # stop if client stopped9 conn.send(str(data)+"*") # return sent data plus an "*"
10 conn.close() # close the connection
Client
1 from socket import *2 s = socket(AF_INET, SOCK_STREAM)3 s.connect((HOST, PORT)) # connect to server (block until accepted)4 s.send(’Hello, world’) # send same data5 data = s.recv(1024) # receive the response6 print data # print the result7 s.close() # close the connection
24 / 62
Communication: Message-oriented communication Simple transient messaging with sockets
Messaging
Message-oriented middleware
Aims at high-level persistent asynchronous communication:
Processes send each other messages, which are queuedSender need not wait for immediate reply, but can do other thingsMiddleware often ensures fault tolerance
25 / 62
Communication: Remote procedure call Basic RPC operation
Basic RPC operation
ObservationsApplication developers are familiar with simple procedure modelWell-engineered procedures operate in isolation (black box)There is no fundamental reason not to execute procedures on separatemachine
ConclusionCommunication between caller & calleecan be hidden by using procedure-callmechanism.
Call local procedureand return results
Call remoteprocedure
Returnfrom call
Client
Request Reply
Server
Time
Wait for result
26 / 62
Communication: Remote procedure call Basic RPC operation
Basic RPC operation
Implementationof doit
Client OS Server OS
Client machine Server machine
Client stub
Client process Server process
1. Client call toprocedure
2. Stub buildsmessage
5. Stub unpacksmessage
6. Stub makeslocal call to “doit”
3. Message is sentacross the network
4. Server OShands messageto server stub
Server stubr = a,bdoit( ) r = a,bdoit( )
proc: “doit”
type1: val(a)
type2: val(b)
proc: “doit”
type1: val(a)
type2: val(b)
proc: “doit”
type1: val(a)
type2: val(b)
1 Client procedure calls client stub.2 Stub builds message; calls local OS.3 OS sends message to remote OS.4 Remote OS gives message to stub.5 Stub unpacks parameters; calls
server.
6 Server does local call; returns result to stub.7 Stub builds message; calls OS.8 OS sends message to client’s OS.9 Client’s OS gives message to stub.10 Client stub unpacks result; returns to client.
27 / 62
Communication: Remote procedure call Parameter passing
RPC: Parameter passing
There’s more than just wrapping parameters into a message
Client and server machines may have different data representations (thinkof byte ordering)Wrapping a parameter means transforming a value into a sequence ofbytesClient and server have to agree on the same encoding:
How are basic data values represented (integers, floats, characters)How are complex data values represented (arrays, unions)
ConclusionClient and server need to properly interpret messages, transforming them intomachine-dependent representations.
28 / 62
Communication: Remote procedure call Parameter passing
RPC: Parameter passing
Some assumptions
Copy in/copy out semantics: while procedure is executed, nothing can beassumed about parameter values.All data that is to be operated on is passed by parameters. Excludespassing references to (global) data.
ConclusionFull access transparency cannot be realized.
A remote reference mechanism enhances access transparency
Remote reference offers unified access to remote dataRemote references can be passed as parameter in RPCsNote: stubs can sometimes be used as such references
29 / 62
Communication: Remote procedure call Parameter passing
RPC: Parameter passing
Some assumptions
Copy in/copy out semantics: while procedure is executed, nothing can beassumed about parameter values.All data that is to be operated on is passed by parameters. Excludespassing references to (global) data.
ConclusionFull access transparency cannot be realized.
A remote reference mechanism enhances access transparency
Remote reference offers unified access to remote dataRemote references can be passed as parameter in RPCsNote: stubs can sometimes be used as such references
29 / 62
Communication: Remote procedure call Parameter passing
RPC: Parameter passing
Some assumptions
Copy in/copy out semantics: while procedure is executed, nothing can beassumed about parameter values.All data that is to be operated on is passed by parameters. Excludespassing references to (global) data.
ConclusionFull access transparency cannot be realized.
A remote reference mechanism enhances access transparency
Remote reference offers unified access to remote dataRemote references can be passed as parameter in RPCsNote: stubs can sometimes be used as such references
29 / 62
Communication: Remote procedure call Variations on RPC
Asynchronous RPCs
EssenceTry to get rid of the strict request-reply behavior, but let the client continuewithout waiting for an answer from the server.
Call local procedure
Call remoteprocedure
Returnfrom call
Client
RequestAcceptrequest
Server Time
Wait foracceptance
Callback to client
Returnresults
Asynchronous RPC 30 / 62
Communication: Remote procedure call Variations on RPC
Sending out multiple RPCs
EssenceSending an RPC request to a group of servers.
Call local procedure
Call local procedure
Call remoteprocedures
Client
Server
Server
Time
Callbacks to client
Multicast RPC 31 / 62
Communication: Remote procedure call Example: DCE RPC
RPC in practice
C compiler
Uuidgen
IDL compiler
C compiler C compiler
Linker Linker
C compiler
Server stubobject file
Serverobject file
Runtimelibrary
Serverbinary
Clientbinary
Runtimelibrary
Client stubobject file
Clientobject file
Client stubClient code Header Server stub
Interfacedefinition file
Server code
#include#include
Writing a Client and a Server 32 / 62
Communication: Remote procedure call Example: DCE RPC
Client-to-server binding (DCE)
Issues(1) Client must locate server machine, and (2) locate the server.
Porttable
Server
DCEdaemon
Client
1. Register port
2. Register service3. Look up server
4. Ask for port
5. Do RPC
Directoryserver
Server machineClient machine
Directory machine
Binding a client to a server 33 / 62
Communication: Remote procedure call Message-oriented persistent communication
Message-oriented middleware
EssenceAsynchronous persistent communication through support of middleware-levelqueues. Queues correspond to buffers at communication servers.
Operations
Operation Description
put Append a message to a specified queueget Block until the specified queue is nonempty, and
remove the first messagepoll Check a specified queue for messages, and remove
the first. Never blocknotify Install a handler to be called when a message is put
into the specified queue
Message-queuing model 34 / 62
Communication: Remote procedure call Message-oriented persistent communication
General model
Queue managers
Queues are managed by queue managers. An application can put messagesonly into a local queue. Getting a message is possible by extracting it from alocal queue only⇒ queue managers need to route messages.
Routing
Local OS
Source queuemanager
Logicalqueue-level
address (name)
Contactaddress
Destination queuemanager
Address lookupdatabase
Look upcontact addressof destinationqueue manager
Local OS
Network
General architecture of a message-queuing system 35 / 62
Communication: Remote procedure call Message-oriented persistent communication
Message broker
ObservationMessage queuing systems assume a common messaging protocol: allapplications agree on message format (i.e., structure and data representation)
Broker handles application heterogeneity in an MQ system
Transforms incoming messages to target formatVery often acts as an application gatewayMay provide subject-based routing capabilities (i.e., publish-subscribecapabilities)
Message brokers 36 / 62
Communication: Remote procedure call Message-oriented persistent communication
Message broker: general architecture
Local OS
Application
Interface
Local OS Local OS
Application
Interface
Broker plugins Rules
Queuinglayer
Source DestinationMessage broker
Message brokers 37 / 62
Communication: Remote procedure call Advanced transient messaging
Making sockets easier to work with
ObservationSockets are rather low level and programming mistakes are easily made.However, the way that they are used is often the same (such as in aclient-server setting).
Alternative: ZeroMQProvides a higher level of expression by pairing sockets: one for sendingmessages at process P and a corresponding one at process Q for receivingmessages. All communication is asynchronous.
Three patterns
Request-replyPublish-subscribePipeline
Using messaging patterns: ZeroMQ 38 / 62
Communication: Remote procedure call Advanced transient messaging
Request-reply
Server
1 import zmq2 context = zmq.Context()3
4 p1 = "tcp://"+ HOST +":"+ PORT1 # how and where to connect5 p2 = "tcp://"+ HOST +":"+ PORT2 # how and where to connect6 s = context.socket(zmq.REP) # create reply socket7
8 s.bind(p1) # bind socket to address9 s.bind(p2) # bind socket to address
10 while True:11 message = s.recv() # wait for incoming message12 if not "STOP" in message: # if not to stop...13 s.send(message + "*") # append "*" to message14 else: # else...15 break # break out of loop and end
Using messaging patterns: ZeroMQ 39 / 62
Communication: Remote procedure call Advanced transient messaging
Request-reply
Client
1 import zmq2 context = zmq.Context()3
4 php = "tcp://"+ HOST +":"+ PORT # how and where to connect5 s = context.socket(zmq.REQ) # create socket6
7 s.connect(php) # block until connected8 s.send("Hello World") # send message9 message = s.recv() # block until response
10 s.send("STOP") # tell server to stop11 print message # print result
Using messaging patterns: ZeroMQ 40 / 62
Communication: Remote procedure call Advanced transient messaging
Publish-subscribeServer
1 import zmq, time2
3 context = zmq.Context()4 s = context.socket(zmq.PUB) # create a publisher socket5 p = "tcp://"+ HOST +":"+ PORT # how and where to communicate6 s.bind(p) # bind socket to the address7 while True:8 time.sleep(5) # wait every 5 seconds9 s.send("TIME " + time.asctime()) # publish the current time
Client
1 import zmq2
3 context = zmq.Context()4 s = context.socket(zmq.SUB) # create a subscriber socket5 p = "tcp://"+ HOST +":"+ PORT # how and where to communicate6 s.connect(p) # connect to the server7 s.setsockopt(zmq.SUBSCRIBE, "TIME") # subscribe to TIME messages8
9 for i in range(5): # Five iterations10 time = s.recv() # receive a message11 print time
Using messaging patterns: ZeroMQ 41 / 62
Communication: Remote procedure call Advanced transient messaging
Pipeline
Source
1 import zmq, time, pickle, sys, random2
3 context = zmq.Context()4 me = str(sys.argv[1])5 s = context.socket(zmq.PUSH) # create a push socket6 src = SRC1 if me == ’1’ else SRC2 # check task source host7 prt = PORT1 if me == ’1’ else PORT2 # check task source port8 p = "tcp://"+ src +":"+ prt # how and where to connect9 s.bind(p) # bind socket to address
10
11 for i in range(100): # generate 100 workloads12 workload = random.randint(1, 100) # compute workload13 s.send(pickle.dumps((me,workload))) # send workload to worker
Using messaging patterns: ZeroMQ 42 / 62
Communication: Remote procedure call Advanced transient messaging
Pipeline
Worker
1 import zmq, time, pickle, sys2
3 context = zmq.Context()4 me = str(sys.argv[1])5 r = context.socket(zmq.PULL) # create a pull socket6 p1 = "tcp://"+ SRC1 +":"+ PORT1 # address first task source7 p2 = "tcp://"+ SRC2 +":"+ PORT2 # address second task source8 r.connect(p1) # connect to task source 19 r.connect(p2) # connect to task source 2
10
11 while True:12 work = pickle.loads(r.recv()) # receive work from a source13 time.sleep(work[1]*0.01) # pretend to work
Using messaging patterns: ZeroMQ 43 / 62
Communication: Remote procedure call Advanced transient messaging
Example: RabbitMQ
ObjectiveRabbitMQ is a message broker. It accepts and forwards messages.
Persistent, asynchronous communicationSender
Create a communication channel and declare a message queuePublish data to the queue
Receiver
Create a communication channel and declare a message queueDefine a callback function to handle incoming informationStart consuming data from the channel
Using messaging patterns: ZeroMQ 44 / 62
Communication: Remote procedure call Advanced transient messaging
Example: RabbitMQ
FeaturesRound-robin dispatchingDurable (persistent) messagesPublish/Subscribe (fanout)Topic-based exchange (filtering)RPC InterfaceMultiple language bindings
Using messaging patterns: ZeroMQ 45 / 62
Communication: Remote procedure call Advanced transient messaging
MPI: When lots of flexibility is needed
Representative operations
Operation Description
MPI bsend Append outgoing message to a local send bufferMPI send Send a message and wait until copied to local or
remote bufferMPI ssend Send a message and wait until transmission startsMPI sendrecv Send a message and wait for replyMPI isend Pass reference to outgoing message, and continueMPI issend Pass reference to outgoing message, and wait until
receipt startsMPI recv Receive a message; block if there is noneMPI irecv Check if there is an incoming message, but do not
block
The Message-Passing Interface (MPI) 46 / 62
Communication: Remote procedure call Example: IBM’s WebSphere message-queuing system
IBM’s WebSphere MQ
Basic concepts
Application-specific messages are put into, and removed from queuesQueues reside under the regime of a queue managerProcesses can put messages only in local queues, or through an RPCmechanism
Message transfer
Messages are transferred between queuesMessage transfer between queues at different processes, requires achannelAt each end point of channel is a message channel agentMessage channel agents are responsible for:
Setting up channels using lower-level network communicationfacilities (e.g., TCP/IP)(Un)wrapping messages from/in transport-level packetsSending/receiving packets
Overview 47 / 62
Communication: Remote procedure call Example: IBM’s WebSphere message-queuing system
IBM’s WebSphere MQSchematic overview
MCA MCA
MQ Interface
Stub
Queuemanager
Serverstub
Send queueRouting table
Enterprise networkRPC(synchronous)
Local network
Message passing(asynchronous)
To other remotequeue managers
Client's receivequeueSending client Receiving client
MCA MCA
MQ Interface
Stub
Queuemanager
Serverstub
Channels are inherently unidirectionalAutomatically start MCAs when messages arriveAny network of queue managers can be createdRoutes are set up manually (system administration)
Overview 48 / 62
Communication: Remote procedure call Example: IBM’s WebSphere message-queuing system
Message channel agents
Some attributes associated with message channel agents
Attribute Description
Transport type Determines the transport protocol to be usedFIFO delivery Indicates that messages are to be delivered in the
order they are sentMessage length Maximum length of a single messageSetup retry count Specifies maximum number of retries to start up the
remote MCADelivery retries Maximum times MCA will try to put received message
into queue
Channels 49 / 62
Communication: Remote procedure call Example: IBM’s WebSphere message-queuing system
IBM’s WebSphere MQ
Routing
By using logical names, in combination with name resolution to local queues, itis possible to put a message in a remote queue
SQ1
SQ1
SQ1
SQ1SQ2
SQ1
SQ1
SQ1SQ1
SQ2
SQ1
SQ1
QMA
QMB
QMA
QMAQMC
QMC
QMC
QMCQMB
QMD
QMD
QMD
Routing table
Routing table
Routing table
Routing table
LA1
LA1
LA1
LA2
LA2
LA2
QMA
QMC
QMA
QMC
QMD
QMD
Alias table
Alias table
Alias table
QMD
QMA
QMB
QMBSQ1
SQ1
SQ1
SQ1
SQ2
SQ2
Message transfer 50 / 62
Communication: Multicast communication Application-level tree-based multicasting
Application-level multicasting
EssenceOrganize nodes of a distributed system into an overlay network and use thatnetwork to disseminate data:
Oftentimes a tree, leading to unique pathsAlternatively, also mesh networks, requiring a form of routing
51 / 62
Communication: Multicast communication Application-level tree-based multicasting
Application-level multicasting in Chord
Basic approach1 Initiator generates a multicast identifier mid .2 Lookup succ(mid), the node responsible for mid .3 Request is routed to succ(mid), which will become the root.4 If P wants to join, it sends a join request to the root.5 When request arrives at Q:
Q has not seen a join request before⇒ it becomes forwarder; Pbecomes child of Q. Join request continues to be forwarded.Q knows about tree⇒ P becomes child of Q. No need to forwardjoin request anymore.
52 / 62
Communication: Multicast communication Application-level tree-based multicasting
ALM: Some costs
Different metrics
Ra
Rb
RcRe
A
B
D
C
Internet
RouterEnd host
Overlay network
75
1
1
1
1
1
30 20
40
E
Rd
Link stress: How often does an ALM message cross the same physicallink? Example: message from A to D needs to cross 〈Ra,Rb〉 twice.Stretch: Ratio in delay between ALM-level path and network-level path.Example: messages B to C follow path of length 73 at ALM, but 47 atnetwork level⇒ stretch = 73/47.
Performance issues in overlays 53 / 62
Communication: Multicast communication Flooding-based multicasting
Flooding
EssenceP simply sends a message m toeach of its neighbors. Eachneighbor will forward that message,except to P, and only if it had notseen m before.
PerformanceThe more edges, the moreexpensive!
The size of a random overlay asfunction of the number of nodes
50
100
150
200
250
300
pedge = 0.6
pedge = 0.4
pedge = 0.2
0
100 500 1000
Number of nodes
Num
ber
of edges (
x 1
000)
VariationLet Q forward a message with a certain probability pflood , possibly evendependent on its own number of neighbors (i.e., node degree) or the degree ofits neighbors.
54 / 62
Communication: Multicast communication Flooding-based multicasting
Flooding
EssenceP simply sends a message m toeach of its neighbors. Eachneighbor will forward that message,except to P, and only if it had notseen m before.
PerformanceThe more edges, the moreexpensive!
The size of a random overlay asfunction of the number of nodes
50
100
150
200
250
300
pedge = 0.6
pedge = 0.4
pedge = 0.2
0
100 500 1000
Number of nodes
Num
ber
of edges (
x 1
000)
VariationLet Q forward a message with a certain probability pflood , possibly evendependent on its own number of neighbors (i.e., node degree) or the degree ofits neighbors.
54 / 62
Communication: Multicast communication Gossip-based data dissemination
Epidemic protocols
Assume there are no write–write conflictsUpdate operations are performed at a single serverA replica passes updated state to only a few neighborsUpdate propagation is lazy, i.e., not immediateEventually, each update should reach every replica
Two forms of epidemics
Anti-entropy: Each replica regularly chooses another replica at random,and exchanges state differences, leading to identical states at bothafterwardsRumor spreading: A replica which has just been updated (i.e., has beencontaminated), tells a number of other replicas about its update(contaminating them as well).
55 / 62
Communication: Multicast communication Gossip-based data dissemination
Anti-entropy
Principle operations
A node P selects another node Q from the system at random.Pull: P only pulls in new updates from QPush: P only pushes its own updates to QPush-pull: P and Q send updates to each other
Observation
For push-pull it takes O(log(N)) rounds to disseminate updates to all N nodes(round = when every node has taken the initiative to start an exchange).
Information dissemination models 56 / 62
Communication: Multicast communication Gossip-based data dissemination
Anti-entropy: analysis
BasicsConsider a single source, propagating its update. Let pi be the probability thata node has not received the update after the i th round.
Analysis: staying ignorant
With pull, pi+1 = (pi)2: the node was
not updated during the i th round andshould contact another ignorant nodeduring the next round.With push,pi+1 = pi(1− 1
N )N(1−pi ) ≈ pie−1 (forsmall pi and large N): the node wasignorant during the i th round and noupdated node chooses to contact itduring the next round.With push-pull: (pi)
2 · (pie−1)
push
pull
push-pull
250 5 10 15 20
1.0
0.8
0.6
0.4
0.2P
robabili
ty n
ot yet update
d
Round
N = 10,000
Information dissemination models 57 / 62
Communication: Multicast communication Gossip-based data dissemination
Anti-entropy performance
push
pull
push-pull
250 5 10 15 20
1.0
0.8
0.6
0.4
0.2
Pro
babili
ty n
ot yet update
d
Round
N = 10,000
Information dissemination models 58 / 62
Communication: Multicast communication Gossip-based data dissemination
Rumor spreading
Basic modelA server S having an update to report, contacts other servers. If a server iscontacted to which the update has already propagated, S stops contactingother servers with probability pstop.
ObservationIf s is the fraction of ignorant servers (i.e., which are unaware of the update), itcan be shown that with many servers
s = e−(1/pstop+1)(1−s)
Information dissemination models 59 / 62
Communication: Multicast communication Gossip-based data dissemination
Rumor spreading
The effect of stopping
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.20
0.10
0.05
0.15
pstop
s
Consider 10,000 nodes1/pstop s Ns
1 0.203188 20322 0.059520 5953 0.019827 1984 0.006977 705 0.002516 256 0.000918 97 0.000336 3
NoteIf we really have to ensure that all servers are eventually updated, rumorspreading alone is not enough
Information dissemination models 60 / 62
Communication: Multicast communication Gossip-based data dissemination
Rumor spreading
The effect of stopping
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.20
0.10
0.05
0.15
pstop
s
Consider 10,000 nodes1/pstop s Ns
1 0.203188 20322 0.059520 5953 0.019827 1984 0.006977 705 0.002516 256 0.000918 97 0.000336 3
NoteIf we really have to ensure that all servers are eventually updated, rumorspreading alone is not enough
Information dissemination models 60 / 62
Communication: Multicast communication Gossip-based data dissemination
Deleting values
Fundamental problem
We cannot remove an old value from a server and expect the removal topropagate. Instead, mere removal will be undone in due time using epidemicalgorithms
SolutionRemoval has to be registered as a special update by inserting a deathcertificate
Removing data 61 / 62
Communication: Multicast communication Gossip-based data dissemination
Deleting values
When to remove a death certificate (it is not allowed to stay for ever)
Run a global algorithm to detect whether the removal is knowneverywhere, and then collect the death certificates (looks like garbagecollection)Assume death certificates propagate in finite time, and associate amaximum lifetime for a certificate (can be done at risk of not reaching allservers)
NoteIt is necessary that a removal actually reaches all servers.
Removing data 62 / 62