Distributed Algorithms for Failure Detection in Crash Environments

UPV / EHU

Distributed Algorithms forFailure Detection inCrash Environments

R. Cortiñas, A. Lafuente, M. Larrea

Distributed Systems GroupUniversity of the Basque Country UPV/EHU

2

UPV / EHU

Master SIA – Sistemas Distribuidos

Guest Stars: P, S and Omega

P: strong completeness, eventual strong accuracy– Eventually every process that crashes is permanently

suspected by every correct process– There is a time after which correct processes are not

suspected by any correct process

S: strong completeness, eventual weak accuracy– There is a time after which some correct process is

never suspected by any correct process

• Omega: eventual leader election– There is a time after which all the correct processes

always trust the same correct process

3

UPV / EHU


The First P Algorithm [CT96]

4

UPV / EHU


p1

p3

p4

p6

p5

p2

Communication Optimality

A ring arrangement of processes

5

UPV / EHU


p1

p3

p4

p6

p5

p2


Communication-efficient algorithms:

n links are used forever

6

UPV / EHU


p1

p3

p4

p6

p5

p2


Communication-optimal algorithms:

C links are used forever

7

UPV / EHU


Communication-optimal P

8

UPV / EHU


• We also propose an optimal implementation of S, the weakest failure detector for solving Consensus:

– processes ordered: p1, ..., pn– heartbeat strategy– communication pattern: one-to-successors– based on a trusted process (instead of a list of suspected

processes)

Communication-optimal Omega

9

UPV / EHU


i) Initially, p1 starts sending messages periodically to the rest of processes, and all processes trust p1

p2p1 p5p4p3

trusted1 = p1 trusted2 = p1 trusted3 = p1 trusted4 = p1 trusted5 = p1


10

UPV / EHU


ii) If a process does not receive a message within some timeout period from its trusted process pi, then it suspects pi and takes the next process pi+1 as its new trusted process

p2p1 p5p4

trusted1 = p1 trusted2 = p1 trusted3 = p1 timeout on p1

trusted4 = p2

trusted5 = p1

p3


11

UPV / EHU


iii) If a process trusts itself, then it starts sending messages periodically to its successors

p2p1 p5p4

trusted1 = p1 trusted3 = p1 trusted4 = p2 trusted5 = p1

p3

timeout on p1

trusted2 = p2


12

UPV / EHU


iv) If a process receives a message from a process pi preceding its trusted process, then it will trust pi again, increasing its timeout period with respect to pi

p2p1 p5

trusted1 = p1 message from p1

trusted2 = p1

timeout_period21++

trusted3 = p2 message from p1

trusted4 = p1

timeout_period41++

trusted5 = p1

p3 p4


13

UPV / EHU


• Lemma. With the previous algorithm, eventually all the correct processes will permanently trust the first correct process in p1, ..., pn

• This property trivially allows us to provide the properties of S:

– Eventual weak accuracy: by not suspecting the trusted process– Strong completeness: by suspecting all the processes except the

trusted process


14

UPV / EHU



Distributed Algorithms for Failure Detection in Crash Environments

Documents