Top Banner
http://ftg.lbl.gov/ checkpoint [email protected] An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric Roman January 13 th , 2004 (Based on slides by Jason Duell)
12

Http://ftg.lbl.gov/checkpoint [email protected] An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

Jan 29, 2016

Download

Documents

Paul Mcdowell
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

An Overview of

Berkeley Lab’s

Linux Checkpoint/Restart

(BLCR)

Paul Hargrove with Jason Duell and Eric RomanJanuary 13th, 2004

(Based on slides by Jason Duell)

Page 2: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Linux checkpoint/restart

Outline

Project goals

System design

Entension interface

Current status

Future work

Page 3: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Uses of Checkpoint/Restart

Gang scheduling● No queue drain for maintenance, policy change● Higher utilization and/or more flexible scheduling

Process migration● Save job if node failure imminent● Pack jobs for optimal network performance

Periodic backup● Not our main focus● Application can always do more efficiently● But may be useful for systems with long jobs, fast I/O,

and/or high node failure rates

Page 4: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Implementation Strategies

Application-based checkpointing● Efficient: save only needed data as step completes● Good for fault tolerance: bad for preemption● Requires per-application effort by programmer

Library-based checkpointing● Portable across operating systems● Transparent to application (but may require relink, etc.)● Can't (generally) restore all resources (ex: process IDs)

● Can’t checkpoint shell scripts

Kernel-based checkpointing● Not portable, and harder to implement● Can save/restore all resources

Page 5: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Design Goals

Target: parallel scientific applications● MPI is a must ● But allow support for other programs/models, too● Esoteric features (ptrace, Unix domain sockets) have

lower implementation priority

Implemention: Linux kernel module● lower barrier to adoption than kernel patch● Allows upgrades, bug fixes, without reboot

Page 6: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Design Goals II

Provide ‘toolkit’ for distributed C/R● We provide single node checkpoint/restart● We don’t support distributed operating system features

• No built-in support for TCP sockets, bproc namespaces, etc.● We provide hooks to allow parallel runtimes/libraries to

implement distributed checkpoint/restart• So the MPI library needs to know about checkpointing, but user

applications don’t

Page 7: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Extension Interface

Callback functions● Registered at startup (or as needed)● Run at checkpoint time, then resume at restart/continue● Handle parallel coordination and/or unsupported objects

Two types of callbacks● Signal handler context

● Run with same PID (LinuxThreads); no thread-safety needed● But callback limited to calling signal-safe functions (small subset of POSIX)

● Separate thread context● Can call any function● But code needs to be thread-safe, and separate PID (LinuxThreads)

Critical sections● Use to protect uncheckpointable sections of code

Page 8: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Current Status

Support LAM-MPI jobs● Both TCP and Myrinet supported● Infrastructure in place for Infiniband, Quadrics● Process migration: currently must restart whole job

Simple semantics for open files● Reopen and seek to original position● Must be regular files (pipe support coming soon)● Files must exist in same location on filesystem

Single- and multi-threaded processes● checkpoint of ‘mpirun’ checkpoints whole MPI job● Will support process groups, sessions in future● Restore original PID

Page 9: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Current status II

Work with wide variety of 2.4 kernels● kernel.org versions 2.4.3 onwards● RedHat: 7.2 through 9 ● SuSE: 7.2 through 9.0 ● autoconf feature probing, so support of custom patched

kernels likely to be automatic● we’ll maintain 2.4 support once 2.6 comes out

Support both new and old pthreads● I.e., old “LinuxThreads”, plus new 2.6 pthreads

(backported to 2.4 by Red Hat)

Page 10: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Future Work

Support for sessions & process groups● Including pipes, mmaps, etc., shared within group● Full restoration of parent/child tree, with original PIDs

More semantics for files● Allow checksum of file, with restart error if it has changed● Allow saving contents of file (restore either clobbers, or opens

anonymously)● Support files that are not open at checkpoint time, but are

specified as being part of the checkpoint

Laundry list of other resources to support● Page 4 of “Design and Implementation” paper

Page 11: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Future Work II

Integration with parallel job systems● Funded to work within suite from DOE Scalable systems

software SciDAC. Work is in progress.● Possibility of OpenPBS, PBSPro support● Interested in others (LSF, SGE, SLURM, etc.)

More MPI implementations● MPICH 2 support anticipated● Vendor support (Quadrics)?● LAM/MPI support for partial/live migration

Page 12: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Conclusion

http://ftg.lbl.gov/checkpoint

Papers (available from website):● “Design and Implementation of BLCR”: high-level system

design, including description of user API● “Requirements for Linux Checkpoint/Restart”: exhaustive

list of Unix features we will support (or not).● “A Survey of Checkpoint/Restart Implementations”:

focusing on open source versions that run on Linux ● “The LAM/MPI Checkpoint/Restart Framework: System-

Initiated Checkpointing”: implementation with LAM/MPI