X10 Using ULFM for implementing Fault Tolerant Applications Fault Tolerant PDE Applications Md Mohsin Ali 1 ([email protected]) Peter E Strazdins 1 Resilient X10 over MPI Sara S. Hamouda 1 ([email protected]) Benjamin Herta 2 Josh Milthorpe 1,2 David Grove 2 Olivier Tardieu 2 1 Australian National University 2 IBM T. J. Watson Research Center Fault Tolerant MPI Applications with ULFM BoF
28
Embed
X10 Using ULFM for implementing Fault Tolerant Applications · Using ULFM for implementing Fault Tolerant Applications Fault Tolerant PDE Applications Md Mohsin Ali1 ([email protected])
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
X10
Using ULFM for implementing Fault Tolerant Applications
• Algorithm-Based Fault Tolerance based on the Sparse Grid Combination Technique (SGCT).
• Designed and implemented general recovery routines for any application:
– Non shrinking recovery (on same or spare nodes)
– Shrinking recovery (under way ...)
Our Success with ULFM MPI
Sparse Grid Combination Technique
• SGCT is a cost-effective method for solving time-evolving PDEs, specially for high dimensionality problems
Fault Tolerant SGCT
Algorithm: FT-SGCT Application
Publications
1. Ali, M. M.; Southern, J.; Strazdins, P. E.; and Harding, B., 2014. Application level fault recovery: Using Fault-Tolerant Open MPI in a PDE solver. In Proceedings of the IEEE 28th International Parallel & Distributed Processing Symposium Workshops (IPDPSW 2014), 1169-1178. Phoenix, USA. doi:10.1109/IPDPSW.2014.132.
2. Ali, M. M.; Strazdins, P. E.; Harding, B.; Hegland, M.; and Larson, J. W., 2015. A fault-tolerant gyrokinetic plasma application using the sparse grid combination technique. In Proceedings of the 2015 International Conference on High Performance Computing & Simulation (HPCS 2015), 499–507. Amsterdam, The Netherlands. (Outstanding paper award)
3. Ali, M. M.; Strazdins, P. E.; Harding, B.; and Hegland, M. Complex scientific applications made fault-tolerant with the sparse grid combination technique. International Journal of High Performance Computing Applications (IJHPCA). (Submitted for Review).
• Good points:– Sufficient functions available for different recovery
implementations
– Lots of example implementations and tutorials are available
– Functions are designed in such a way that finding out application bugs is easy
Our ULFM Experience
• Improvements points:– The Two-Phase Commit agreement algorithm does not
scale well on large core counts
• Log Two-Phase Commit is scalable, but instable
– Parallel I/O and non-blocking collectives are not supported
– Performance varies according to the identity of the failed process
– Bugs, hanging issues
Our ULFM Experience
Resilient X10 over ULFM
Sara S. Hamouda1, Benjamin Herta2,Josh Milthorpe1,2, David Grove2,, and Olivier Tardieu2
1Australian National University, 2IBM T. J. Watson Research Center
• Asynchronous Partitioned Global Address Space language
val src = new Rail[Int](SIZE, (i:Long)=> i as Int);
val dst = new Rail[Int](SIZE);
team.allreduce(src, 0, dst, 0, SIZE, Team.ADD);
}
MPI_Comm_create(MPI_COMM_WORLD, grp, &comm);
MPI_Iallreduce( …. );
X10 Over ULFM
• Team APIs
val team = new Team(places);
finish for (place in places) at (place) async {
val src = new Rail[Int](SIZE, (i:Long)=> i as Int);
val dst = new Rail[Int](SIZE);
team.allreduce(src, 0, dst, 0, SIZE, Team.ADD);
}
MPI_Iallreduce( …. );
OMPI_Comm_shrink(MPI_COMM_WORLD, &shrunken);
MPI_Comm_create(shrunken, grp, &comm);
X10 Over ULFM
• Team APIs
val team = new Team(places);
finish for (place in places) at (place) async {
val src = new Rail[Int](SIZE, (i:Long)=> i as Int);
val dst = new Rail[Int](SIZE);
team.allreduce(src, 0, dst, 0, SIZE, Team.ADD);
}
MPI_Iallreduce( …. );
OMPI_Comm_shrink(MPI_COMM_WORLD, &shrunken);
MPI_Comm_create(shrunken, grp, &comm);
Non blocking collectives are not supported in the current ULFM implementation
X10 Over ULFM
• Team APIs – Moving to blocking collective
val team = new Team(places);
finish for (place in places) at (place) async {
val src = new Rail[Int](SIZE, (i:Long)=> i as Int);
val dst = new Rail[Int](SIZE);
team.allreduce(src, 0, dst, 0, SIZE, Team.ADD);
}
MPI_allreduce( …. );
OMPI_Comm_shrink(MPI_COMM_WORLD, &shrunken);
MPI_Comm_create(shrunken, grp, &comm);
Blocking collective
x10_emu_barrier();
X10 Over ULFM
Non Resilient Resilient no failure Resilient with a failure(3 checkpoints + 1 restore)
0
2
4
6
8
10
12
14
16
X10 over Sockets (IP over Infiniband) X10 over ULFM (Infiniband)
Tim
e in
se
con
ds
• LULESH proxy application
The performance improvement due to using ULFM v1.0 for running the LULESH proxy application, running on 64 processes on 16 nodes with problem size 203 per process. The cluster is an AMD64 Linux cluster, each node having 16G RAM and 2 quad core AMD Opteron 2356 processors.
• Good points:– Sufficient functions available for different recovery
implementations
– Lots of example implementations and tutorials are available
– Functions are designed in such a way that finding out application bugs is easy
➢ Flexibility of the minimalistic fault tolerance approach provided by ULFM
➢ Prompt support from the ULFM team
Our ULFM Experience
• Improvement points:– The Two-Phase Commit agreement algorithm does not
scale well on large core counts
• Log Two-Phase Commit is scalable, but instable
– Parallel I/O and non-blocking collectives are not supported
– Performance varies according to the identity of the failed process
– Bugs, hanging issues➢ ULFM is based on an old OpenMPI 1.7 version, in which
multi-threading is not well tested.➢ Portability and continuity concerns
Our ULFM Experience
• Resilient X10 applications can now run over ULFM and achieve better performance with the optimized MPI communication routines and the support for high speed network protocols provided by MPI (e.g. Infiniband verbs).