Page 1
Computing in theReliable Array of Independent Nodes
Vasken Bohossian, Charles Fan, Paul LeMahieu, Marc Riedel, Lihao Xu, Jehoshua Bruck
May 5, 2000
IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems
California Institute of Technology
Marc RiedelMarc Riedel
Page 2
RAIN Project
Collaboration:
• Caltech’s Parallel and Distributed Computing Group www.paradise.caltech.edu
• JPL’s Center for Integrated Space Microsystems www.csmt.jpl.nasa.gov
Page 3
RAIN Platform
switchswitch
bus
netw
ork
Heterogeneous network of nodes and switches
nodenode
nodenode
nodenode
nodenode
switch
nodenode
nodenode
Page 4
RAIN Testbed
www.paradise.caltech.edu
10 Pentium boxesw/multiple NICs
4 eight-way Myrinet Switches
Page 5
Proof of Concept: Video Server
AA BB CC DD
switch switch
Video client & server on every node.
Page 6
Limited Storage
AA BB CC DD
Insufficient storage to replicate all the data on each node.
switch switch
Page 7
k-of-n Code
ad+c
bd+a
ca+b
db+c
from any k of n columns
b = a+b a+
d = d+c c+
a b c d
recover data
Erasure-correcting code:
Page 8
Encoding
AA BB CC DD
Encode video using 2-of-4 code.
switch switch
Page 9
Decoding
AA BB CC DD
Retrieve data and decode.
switch switch
Page 10
Node Failure
AA BB CC DD
switch switch
Page 11
Node Failure
AA BB CC D
switch switch
Page 12
Node Failure
Dynamically switch to another node.
AA BB CC D
switch switch
Page 13
Link Failure
BB DCCAA
switch switch
Page 14
Link Failure
AA BB CC D
switch switch
Page 15
Link Failure
Dynamically switch to another network path.
AA BB DCC
switch switch
Page 16
Switch Failure
AA BB DCC
switch switch
Page 17
Switch Failure
AA BB CC D
switch switch
Page 18
Switch Failure
AA CC D
Dynamically switch to another network path.
BB
switch switch
Page 19
Node Recovery
AA CCBB DD
switch switch
Page 20
Node Recovery
AA CCBB DD
switch switch
Continuous reconfiguration (e.g., load-balancing).
Page 21
Features
• tolerates multiple node/link/switch failures• no single point of failure
High availability:
Certified
Buzz-Word
Compliant
• multiple data paths • redundant storage• graceful degradation
Efficient use of resources:
Dynamic scalability/reconfigurability
Page 22
RAIN Project: Goals
NetworksCommunication
key building blocks
Storage
Applications
Efficient, reliable distributed computing and storage systems:
Page 23
Topics
• Fault-Tolerant Interconnect Topologies
• Connectivity
• Group Membership
• Distributed Storage
Today’s Talk:
NetworksCommunication
Storage
Applications
Page 24
Interconnect Topologies
= computing/storage node
Network
Goal: lose at most a constant number of nodes for given network loss
NN NN NN NN NN NN NN NN NN NN
NN
Page 25
= computing/storage node
Network
NN NN NN NN NN NN NN NN NN NN
NN
Resistance to Partitions
Large partitions problematic for distributed services/computation
Page 26
Resistance to Partitions
= computing/storage node
Large partitions problematic for distributed services/computation
NN
NN NN NN NN NN NN NN NN NN NN
Network
Page 27
Related Work
• Hayes et al., Bruck et al., Boesch et al.
Embedding hypercubes, rings, meshes, trees in fault-tolerant networks:
• Ku and Hayes, 1997. “ Connective Fault-Tolerance in Multiple-Bus Systems”
Bus-based networks which are resistant to partitioning:
IEEE ACM
Page 28
A Ring of Switches
NN
SS
SS
SS
SS
SSSS
SS
NN
NN
NN
NN
NN
NN
= Node
= SwitchSS
NN
a naïve solution
degree-2 compute nodes,degree-4 switches
Page 29
A Ring of Switches
NN
SS
SS
SS
SS
SSSS
SS
NN
NN
NN
NN
NN
NN
= Node
= SwitchSS
NN
a naïve solution
degree-2 compute nodes,degree-4 switches
Page 30
= Node
= SwitchSS
NN
easily partitioned
A Ring of Switches
a naïve solution
degree-2 compute nodes,degree-4 switches NN
SS SS
SS
SS
SS
NN
NN
NN
NN
NN
NN
S
S
Page 31
Resistance to Partitioning
11
11 22
33
44
55
66
77
88
22
33
4455
66
77
88nodes on diagonals
degree-2 compute nodes,degree-4 switches
Page 32
Resistance to Partitioning
11
11 22
33
44
55
66
77
88
22
33
4455
66
77
88nodes on diagonals
degree-2 compute nodes,degree-4 switches
Page 33
nodes on diagonals
degree-2 compute nodes,degree-4 switches
• tolerates any 3 switch failures (optimal)
• generalizes to arbitrary node/switch degrees.
Resistance to Partitioning
Details: paper IPPS’98, www.paradise.caltech.edu
22
33
55
77
88
22
33
4455
77
88
1
1
46
6
Page 34
11
11 22
33
44
55
77
88
22
33
4455
66
77
88
66
11
22
33
44
55
66
77
88
11
22
33 44
55
66
7788
Resistance to Partitioning
Isomorphic
Details: paper IPPS’98, www.paradise.caltech.edu
Page 35
Point-to-Point Connectivity
Is the path from A to B up or down?
?nodenode
nodenode
nodenode
nodenode
nodenode
nodenode
A
B
Network
Page 36
Connectivity
Link is seen as up or down by each node.
NodeA
NodeB
{U,D} {U,D}
Bi-directional communication.
Each node sends out pings.A node may time-out, deciding the link is down.
Page 37
Consistent History
A B
U
D
UU
U
D
DD
Time
NodeState
A B
U
D
U
U
U
D D
D
Time
U
UD
NodeState
A B
Page 38
The Slack
NodeState
A B
Time
U
D
U
U
U
D D
DU
UD
A is 1 ahead
A is 2 ahead
Now A will wait for B to transition
Slack n=2:at most 2 unacknowledged transitions before a node waits
Page 39
Consistent History
Consistency in error reporting:If A sees channel error, B sees channel error.
Birman et al.: “Reliability Through Consistency”
NodeA
NodeB
{U,D} {U,D}
Details: paper IPPS’99, www.paradise.caltech.edu
Page 40
Group Membership
BBAA
DDCC
ABCDABCD
ABCDABCD
• link/node failures• dynamic reconfiguration
Consistent global view given local, point-to-point connectivity information
Page 41
Related Work
Totem, Isis/Horus, TransisSystems
TheoryChandra et al., Impossibility of Group Membership in an Asynchronous Environment
IEEE ACM
Page 42
Token-Ring based Group Membership Protocol
BBAA
CCDD
Group Membership
Page 43
Group Membership
BBAA
CCDD
• group membership list
• sequence number
Token carries:
Token-Ring based Group Membership Protocol
1: ABCD
Page 44
Group Membership
BBAA
CCDD
• group membership list
• sequence number
Token carries:
Token-Ring based Group Membership Protocol
1: ABCD
1
Page 45
Group Membership
BBAA
CCDD
• group membership list
• sequence number
Token carries:1
Token-Ring based Group Membership Protocol
2
2: ABCD
Page 46
Group Membership
BBAA
CCDD
• group membership list
• sequence number
Token carries:1
Token-Ring based Group Membership Protocol
2
3
3: ABCD
Page 47
Group Membership
BBAA
CCDD
• group membership list
• sequence number
Token carries:1
Token-Ring based Group Membership Protocol
2
34
4: ABCD
Page 48
Group Membership
BBAA
CCDD
• group membership list
• sequence number
Token carries:5
Token-Ring based Group Membership Protocol
2
34
Page 49
Group Membership
BBAA
CCDD
5 2
34
Node or link fails:
Page 50
Group Membership
BAA
CCDD
5
34
Node or link fails:
Page 51
Group Membership
BAA
CCDD
?
5
34
Node or link fails:
Page 52
Group Membership
BAA
CCDD
?
5
34
Node or link fails:
Page 53
Group Membership
BAA
CCDD
5
34
If a node is inaccessible,it is excluded and bypassed.
5: ACD
Node or link fails:
Page 54
Group Membership
BAA
CCDD
5
64
If a node is inaccessible,it is excluded and bypassed.
6: ACD
Node or link fails:
Page 55
Group Membership
BAA
CCDD
5
67
If a node is inaccessible,it is excluded and bypassed.
Node or link fails:
Page 56
Group Membership
BAA
CCDD
5
67
If a node is inaccessible,it is excluded and bypassed.
Node or link fails:
Page 57
Group Membership
BAA
CCDD
5
67
Node with token fails:
Page 58
Group Membership
AA
CCD
5
6
B
Node with token fails:
Page 59
Group Membership
AA
CCD
5
6
B
?
?
Node with token fails:
Page 60
Group Membership
AA
CCD
5
6
If the token is lost,it is regenerated.
B
?
?
Node with token fails:
Page 61
Group Membership
AA
CCD
5
6
If the token is lost,it is regenerated.
B
Node with token fails:
Page 62
Group Membership
AA
CCD
5
6
6: AD
If the token is lost,it is regenerated.
B
5: ACD
Node with token fails:
Page 63
Group Membership
AA
CCD
5
6
6: AD
If the token is lost,it is regenerated.
B
5: ACD
Highest sequence numberprevails.
Node with token fails:
Page 64
Group Membership
AA
CCD
7
6
Highest sequence numberprevails.
If the token is lost,it is regenerated.
B
Node with token fails:
Page 65
Group Membership
AA
CC
7
6
Node recovers:
B
DD
Page 66
Group Membership
AA
CC
7
6
B
DD
Recovering nodesare added.
Node recovers:
Page 67
Group Membership
AA
CC
7
6
B
DD
Recovering nodesare added.
7: ADC
Node recovers:
Page 68
Group Membership
AA
CC
7
6
B
DD
Recovering nodesare added.
8: ADC
8
Node recovers:
Page 69
Group Membership
AA
CC
7
9
B
DD
Recovering nodesare added.
9: ADC
8
Node recovers:
Page 70
Group Membership
AA
CC
10 B
DD
Recovering nodesare added.
98
Node recovers:
Page 71
Group Membership
AA
CC
10 B
DD 98
• Unicast messages
• Dynamic reconfiguration
• Mean time-to-failure > convergence time
Features:
Details: publication forthcoming.
Page 72
Distributed Storage
disk disk diskdisk
101001001000101001001000
Page 73
Distributed Storage
disk disk disk
Focus: reliability and performance.
disk
1010 10 101 11
Page 74
Array Codes
ad+c
bd+a
ca+b
db+c
Ideally suited for distributed storage. Low encoding/decoding complexity.
dataredundancy
“B-code”
Page 75
Array Codes
ad+c
bd+a
ca+b
db+c
Ideally suited for distributed storage. Low encoding/decoding complexity.
from any k of n columns
b = a+b a+
d = d+c c+
a b c d
recover data
Page 76
Array Codes
ad+c
bd+a
ca+b
db+c
Ideally suited for distributed storage. Low encoding/decoding complexity.
a b c d
Details: IEEE Trans. Info Theory, www.paradise.caltech.edu
B-Code and X-Code:• optimally redundant• optimal encoding/decoding complexity
Page 77
Summary
1
1 2
3
4
5
6
7
8
2
3
45
6
7
8
Fault-tolerant Interconnect Topologies
Connectivity
A B
{U,D} {U,D}
Group Membership
BA
CD
1: ABCD
1 2
34
2: ABCD
3: ABCD
4: ABCD
Distributed Storage
a
d+c
b
d+a
c
a+b
d
b+c
Page 78
Proof-of-Concept Applications
RAINVideoHigh-availability video server
RAINCheckDistributed checkpoint rollback/recovery system
SNOWStable Network of Webservers
Page 79
Rainfinity
www.rainfinity.com
Start-up based on RAIN technology
• availability
• scalability
• performance
Clustered solutions for Internet data centers, focusing on:
Business Plan:
Page 80
Rainfinity
• Founded Sept. 1998
• Released first product April 1999
• Received $15 million funding in Dec. 1999
• Now over 50 employees
www.rainfinity.com
Start-up based on RAIN technology
Company:
Page 81
Future Research
• Development of API’s• Fault-Tolerant Distributed Filesystem• Fault-Tolerant MPI/PVM implementation
Page 82
End of Talk
Material that was cut...
Page 83
Erasure Correcting Codes
data
k
1 0 1 0 1 1 0 1 0 0 010
encoded data
n
Strategy:encode data with an erasure-correcting code.
Page 84
Erasure Correcting Codes
k
1 0 1 0 1 1 0 1 0 0 010
n
lose up to m coordinates
data
Strategy:encode data with an erasure-correcting code.
Page 85
Erasure Correcting Codes
1 0 1 0 1 1 0 1 0 0 010
n
reconstructed data
k
Strategy:encode data with an erasure-correcting code.
lose up to m coordinates
k
data
Page 86
Erasure Correcting Codes
Code is optimally-redundant (MDS) if . knm Example: Reed-Solomon code.
1 0 1 0 1 1 0 1 0 0 010
n
reconstructed data
k
lose up to m coordinates
k
data
Page 87
RAIN: Distributed Store
ad+c
bd+a
ca+b
db+c
disk disk disk disk
ad+c
bd+a
ca+b
db+c
• Encode data with (n, k) array code
• Store one symbol per node
Page 88
RAIN: Distributed Retrieve
disk
ad+c
disk
bd+a
disk
ca+b
disk
db+c
• Retrieve encoded data from any k nodes
• Reconstruct data
a b c d
Page 89
RAIN: Distributed Retrieve
disk
ad+c
disk
bd+a
disk
ca+b
disk
db+c
a b c d• Reliability (similar to RAID systems)
Page 90
RAIN: Distributed Retrieve
disk
ad+c
disk
bd+a
disk
ca+b
disk
db+c
a b c d• Reliability (similar to RAID systems)
Page 91
RAIN: Distributed Retrieve
disk
ad+c
disk disk
ca+b
disk
• Reliability (similar to RAID systems)
a b c d
• Performance: load-balancing
Page 92
RAIN: Distributed Retrieve
disk
ad+c
disk disk
ca+b
disk
db+c
a b c d• Reliability (similar to RAID systems)
• Performance: load-balancing
Page 93
RAIN: Distributed Retrieve
disk
ad+c
disk disk
ca+b
disk
db+c
busy!
a b c d• Reliability (similar to RAID systems)
• Performance: load-balancing