Checkpointing Approach for Multiple Processor Failures IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING Ge-Ming Chiu, Member, IEEE Computer Society, and Jane-Ferng Chiu Presented By, Linda Maria Pulickal S7 CSE
Aug 30, 2014
A New Diskless Checkpointing
Approach for Multiple Processor
Failures
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING
Ge-Ming Chiu, Member, IEEE Computer Society, and Jane-Ferng Chiu
Presented By,Linda Maria Pulickal
S7 CSE
Check Point Snapshot of current application state.
Used to restart the execution in case of failure.
Very important in large scale distributed computing.
INTRODUCTION
Checkpoints are stored in the primary storage memory of peer processors.
No need of secondary storage - saves time.
What is DiskLess CheckPointing?
No latency = No performance degradation.
When stable storage is unavailable. Eg: mobile computing systems.
Effective in a large scale (10,000-100,000 processors).
Advantage of DiskLess approach
Diskless checkpointing
neighbor-based
each processorsaves its checkpoints
in entirety in the memory of peer
processors.
Parity-baseduse a dedicated checkpoint pro-
cessor to store the parity of the
checkpoints taken by all the
application processors using XOR operations.
Reed-Solomon coding-basedencodes
checkpoints of multiple processors using Reed-Solomon
erasure coding techniques.
Extra dedicated processors for storing checkpoint data.
Difficulty finding extra processors. Eg: mobile computing systems
this addition increases failure probability.
Memory overhead.
Problem with existing techniques
System Model
Collection of n processors (or nodes), P0, P1, P2, ... ,Pn-1, interconnected by a (wired or wireless) network.
1. Diskless checkpointing
scheme to tolerate up to k simultaneous failures.
2. Reduce memory overhead.
GOALS
Basic Operation of the Proposed
Scheme
Important terms:
1. Checkpoint Storage Nodes2. Checkpoint Coverage Nodes
S7
S8
S9
S10
Checkpoint COVERAGE - CCi
Checkpoint STORAGE - CSi
P1
P2
P3
P4
+ S5
P5
P5
Each Pi send its checkpoint to at least k other processors (CSi).
-- at least one of CSi will remain alive for each
failed processor.
Pi also stores a copy of the state in a distinct section of its memory.
-- to help other failed processors decode their previous checkpoints.
Steps:
Each Pi calculates the parity from CCi using XOR.
Stores only the parity result in memory.
Advantage: Memory space of size equal to the
maximum checkpoint.
The conceptual framework of diskless checkpointing approach.
Recovery
P5.
S6.
P5 want to recover
P1
P2
P5S6+
P6 node is used
P6 State:
S6 = P1 + P2 + P5
P5 = S6 – P1 – P2
Safe Recovery CriterionFor any failed processor Pi, at least one
node in CSihas all of its checkpoint coverage nodes
intact.
DETERMINING THE CHECKPOINT STORAGE NODE SET
the cardinality of CSi must be at least k.
the cardinality of CCi is k to ensure good load balance.
Fundamentals of CSi
+ S2
S4
P3
P4
S0 = P3 + P4S1 = P2 + P3S2 = P0 + P1S3 = P2 + P4S4 = P0 + P1
S0P0
Not Good Design.. How? CS0 ∩ CS1 = { P2, P4 } , more than 1 element
+ S1
S2
P3
P4
S0 = P3 + P4S1 = P0 + P4S2 = P0 + P1S3 = P1 + P2S4 = P2 + P3
S0P0
Not Good Design.. How? P1 Є CS0 ; CS0 ∩ CS1 = { P2}
For all Pi and Pr,
(1) │CSi ∩ CSr │ ≤ 1 , i ≠ r
For each Pi,
(2) CSi ∩ CSr = ᶲ , for any Pr Є CSi.
Theorms
Design of CSi’s
Cyclic design concept.
Derived from CS0 as,
Only focus on CS0 design
PSR Sequence:d0, d1, d2, ... ,dr-1 is PSR if NO l, m, p, and q 0 ≤ l ≤ m < p ≤ q ≤ r – 1 satisfy,
Eg: 2, 1 , 5 , 3 not PSR 1, 3 , 5 , 2 is PSR
PSR ensures no 2 processors share more than 1 checkpoint storage node.
Design of CS0
Construct a PSR sequence of 3 (i.e k -1) +ve integers.
Select sequence with minimum sum, D. Eg: d0 = 1 ; d1 = 3 ; d2 = 2 & D = 6.
First element of CS0 : PD+1 = P7
ADD d0 , d1 , d2 as respective increments to P7.
CS0 ={ P7, P8, P11, P13}
Steps for k =4:
total no. of processors in the system ≥ 3D+2.
Ensure theorm 2.
Requirements
Performance Analysis
28
?
Thank You ….