Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

Fast Failure Re-covery

in Distributed Graph Processing

SystemPresented By HaeJoon Lee

Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tu-dor

National University of SingaporeWei LuRenmin University

Cang ChenZhejiang University

Big Data Final Seminar

VLDB 2014

H.V. JagadishUniversity of Michigan

Outline

Presented by Haejoon Lee

Partition Based Recovery

Implementation

Evaluation

Conclusion

Motivation

Background

1 Background

2 / 20Presented by Haejoon Lee

Distributed Graph Processing System- The set of vertices and edges is divided partitions.- The partitions are distributed among compute nodes.

Bulk Synchronous ParallelComputation model in DGPS.

Each worker executes input phase. Then they are iteratively process-ing by Global Barrier.

1 Background


Scaling the # of nodes causes two effects - It increase the # of failed nodes during job-execution.- System progress stops during recovery, so a # of nodes could become idle.

For these reasons, we need efficient failure recovery system

Why do you think?

2 Motivation


Checkpoint Based Recovery Flow- Requires nodes to write the status to storage as checkpoint.- Uses healthy nodes to load the status from the last check-point.- Re-executes all the missing workloads.

However, CBR causes high recovery latency.- Re-executes the missing workloads over the whole graph in failed and even healthy nodes.

2 Motivation


Problem in Cascading Failure - Def. failure occurs during normal execution at any time. - Frequent check-pointing will incur long execution time.

Proposes Fast Failure Recovery (Partition Based Recovery)

Outline



Implementation

Evaluation

Conclusion

Motivation

Background

3 Partition Based Recovery


Execution Flow- Restricts recovery of subgraph in only failed nodes using log msg.- Divides the subgraphs in only failed nodes into partitions.- Distribute these partitions among computer nodes.- Reload these partitions from the last checkpoint and rebalance it

What is locally log message in PBR? - PBR require every node to log its outgoing msg at the end of super step.

- Every healthy node forwards the log msg to vertices in failed partitions.

3 CBR vs PBR


C D

A B

E F

Checkpoint Based Recovery

N1

N2

C ’ D ‘

A ‘ B ‘

E ‘ F ‘

Each node storage has Checkpoint

CBR incurs HIGH computation cost and communication cost

< If N2 node fails >

3 CBR vs PBR



A B

C D

E F

N1

N2

< If N2 node fails >

3 Details of PBR


A B

C D E

F

N1 N2

1. Reassignment Partition- Random assigning partitions

- In each iteration calculate the above one for Cost

- Check the minimal cost

- Find the optimal partition based minimal cost

Optimal Partition after checking generated partition

3 Details of PBR


A BC D E F

N1 N22. Recomputation Missing Workload- Failed partitions (A,B),

(C,D)load checkpoint in step11

- Healthy Partition D for-wards locally log msg to vertices in failed partitions

A BC D E F

A BC D E F

Superstep 11

Superstep 12

< If N2 node fails in Super step 12>

Assuming: the latest checkpoint is in super step 11

Locally log messageCompute vertices from checkpoint

FailedHealthy

3 Details of PBR


3. Re-balance configuration if each node’s one is different

How to handle Cascading Failure?- Unlike the CBR’s handling, PBR treats cascading failure as normal failure by executing these 3 steps- In practice, the occurrence of failure is not very frequent.

4 PBR Architecture on Giraph


Master - ‘Assign Partitions’ as recovery plan and save it to Zookeeper

Zookeeper - a centralized service for maintaining configuration in-formation, naming, providing distributed synchronization

Slaves - fetch the partitions from Zookeeper

If ( slaves are in checkpoint step )they do checkpoint, and perform computation

Else if ( slaves are failed as restart ) they load partitions and perform computation

Outline

Presented by Haejoon Lee


Implementation

Evaluation

Conclusion

Motivation

Background

5 Experimental SetupCBR vs PBR

Benchmark - *K-means, Semi-clustering, and *PageRank- Runs all the tasks for 20 super steps.- Performs a checkpoint at the beginning of step 11.

Cluster - 72 Compute Nodes- Intel X3430 2.4GHZ, 8GB memory, 2 * 500GB HDD

- Giraph with PBR runs as MapReduce job on Hadoop

Dataset


5 Evaluation- K-means CBR vs PBR


< Checkpoint at the beginning of super step 11 >

PBR outperforms 12.4 to 25.7 than CBR.The recovery time of two function in-crease linearly.

PBR takes almost the same time as CBR.- No outgoing msg among differ vertices in K-means- The time of checkpoint is negligible compared to com-puting the new belonging clusters

5 Evaluation- K-means CBR vs PBR


These experiments verify the effectiveness of PBR, which parallelizes computation and eliminates unnecessary recovery cost.


PBR outperforms 6.8 to 23.9 than CBR- In CBR, no matter how many nodes fail be-cause they have to reload all computation

PBR can reduce recovery time by 23.8 to 26.8 than CBR.

5 Evaluation- PageRank CBR vs PBR



PBR takes slightly more time than CBR.- Friendester’s property is Power-law links.- Each super step involve a # of forwarding logged msg via Disk I/O.

Check Pointing

5 Evaluation- PageRank CBR vs PBR


These experiments verify the effectiveness of PBR, which parallelizes computation and eliminates unnecessary recovery cost.


6 Conclusion


Partition based recovery is proposed as novel recovery system which parallelize failure recovery processing.

This system distributes the recovery task to multiple compute nodes such that the recovery processing can be executed concur-rently

It is implemented on the widely used Girpah system and observe outperforms existing checkpoint-based recovery stem by up to 30 times

Thanks

6 Backup: Semi-Clustering

Master Seminar PresentationPresented by Haejoon Lee

6 PBR Architecture on Giraph


Master - ‘Assign Partitions’ as recovery plan and save it to Zookeeper

Slaves fetch the partitions from Zookeeper - If they are in checkpoint step, they do and perform computation- If they are in fail as restart, they load partitions and perform it

6 Backup: Communication Cost of PR

Master Seminar PresentationPresented by Haejoon Lee

Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

Documents

failed nodes

haejoon leepartition

recovery flow

healthy nodes

haejoon leecdabefcheckpoint

haejoon leeproblem

haejoon leecheckpoint

haejoon leescaling