Top Banner
Linux Kernel Summit, November 2010 1 [email protected] Linux Kernel Summit, November 2010 1 [email protected] Linux-CR: Transparent Application Checkpoint-Restart in Linux Linux-CR: Transparent Application Checkpoint-Restart in Linux Oren Laadan Columbia University [email protected]
41

Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

May 30, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 1 [email protected] Kernel Summit, November 2010 1 [email protected]

Linux-CR:

Transparent Application Checkpoint-Restart in Linux

Linux-CR:

Transparent Application Checkpoint-Restart in Linux

Oren LaadanColumbia University

[email protected]

Page 2: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 2 [email protected] Kernel Summit, November 2010 2 [email protected]

Application C/RApplication C/R

◆ Application Checkpoint/Restart

a mechanism to save the state ofrunning application(s) so that they can later resume execution from that point

Page 3: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 3 [email protected] Kernel Summit, November 2010 3 [email protected]

checkpointimage

Application C/RApplication C/R

original restoredhierarchy hierarchy

restartcheckpoint

Page 4: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 4 [email protected] Kernel Summit, November 2010 4 [email protected]

What is it good for ?What is it good for ?

◆ Application roll back to the past◆ Application suspend and resume◆ Application migration

Page 5: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 5 [email protected] Kernel Summit, November 2010 5 [email protected]

Application rollbackApplication rollback

◆ Fault tolerance◆ Effective debugging◆ Fast application start-up◆ Software testing◆ Generic time-machine

Page 6: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 6 [email protected] Kernel Summit, November 2010 6 [email protected]

Application rollbackApplication rollback

◆ Fault tolerance◆ long running applications◆ cloud, HPC, at work, at home

◆ Effective debugging◆ Fast application start-up◆ Software testing◆ Generic time-machine

Page 7: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 7 [email protected] Kernel Summit, November 2010 7 [email protected]

Application rollbackApplication rollback

◆ Fault tolerance◆ Effective debugging

◆ Super-core-dump● more details, multiple tasks

◆ re-run from checkpoint● trace, profile, and instrument

◆ Fast application start-up◆ Software testing◆ Generic time-machine

Page 8: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 8 [email protected] Kernel Summit, November 2010 8 [email protected]

Application rollbackApplication rollback

◆ Fault tolerance◆ Effective debugging◆ Fast application start-up

◆ from default/previous state (ccache...)◆ improve desktop boot time

◆ Software testing◆ Generic time-machine

Page 9: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 9 [email protected] Kernel Summit, November 2010 9 [email protected]

Application rollbackApplication rollback

◆ Fault tolerance◆ Effective debugging◆ Fast application start-up◆ Software testing

◆ repeat from specific point(s)◆ distribute on multiple hosts

◆ Generic time-machine

Page 10: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 10 [email protected] Kernel Summit, November 2010 10 [email protected]

Application rollbackApplication rollback

◆ Fault tolerance◆ Effective debugging◆ Fast application start-up◆ Software testing◆ Generic time-machine

◆ revive old server/desktop state◆ retry a move in a game

Page 11: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 11 [email protected] Kernel Summit, November 2010 11 [email protected]

Application suspend/resumeApplication suspend/resume

◆ Improved OOM handling◆ Better system utilization◆ Suspend/resume a user's session

Page 12: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 12 [email protected] Kernel Summit, November 2010 12 [email protected]

Application suspend/resumeApplication suspend/resume

◆ Improved OOM handling◆ suspend applications, don't kill◆ smart “swap” on embedded

◆ Better system utilization◆ Suspend/resume a user's session

Page 13: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 13 [email protected] Kernel Summit, November 2010 13 [email protected]

Application suspend/resumeApplication suspend/resume

◆ Improved OOM handling◆ Better system utilization

◆ suspend application to reduce load

◆ Suspend/resume a user's session

Page 14: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 14 [email protected] Kernel Summit, November 2010 14 [email protected]

Application suspend/resumeApplication suspend/resume

◆ Improved OOM handling◆ Better system utilization◆ Suspend/resume a user's session

◆ mobile desktop on USB key◆ linux based VPS/VDI systems

Page 15: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 15 [email protected] Kernel Summit, November 2010 15 [email protected]

Application MigrationApplication Migration

◆ Load balancing / resource sharing◆ Zero-downtime maintenance◆ High availability

Page 16: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 16 [email protected] Kernel Summit, November 2010 16 [email protected]

Application MigrationApplication Migration

◆ Load balancing / resource sharing◆ HPC (e.g. BlueWaters project)◆ cloud environments◆ linux-based VPS/VDI

◆ Zero-downtime maintenance◆ High availability

Page 17: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 17 [email protected] Kernel Summit, November 2010 17 [email protected]

Application MigrationApplication Migration

◆ Load balancing / resource sharing◆ Zero-downtime maintenance

◆ live migration of applications

◆ High availability

Page 18: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 18 [email protected] Kernel Summit, November 2010 18 [email protected]

Application MigrationApplication Migration

◆ Load balancing / resource sharing◆ Zero-downtime maintenance◆ High availability

◆ primary/backup in lock-step◆ frequent incremental checkpoints

Page 19: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 19 [email protected] Kernel Summit, November 2010 19 [email protected]

Application vs Virtual-MachineApplication vs Virtual-Machine

Application Virtual C/R Machine

granularity specific operating systemapplications as a whole unit

saved state application entire operatingstate only system state

overhead none visible

flexibility application operating systemawareness is black box

deployment linux only same arch family

Page 20: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 20 [email protected] Kernel Summit, November 2010 20 [email protected]

Some examplesSome examples

◆ HPC environments◆ can extend linux-cr → distributed-cr

◆ Cloud deployments◆ using linux containers

◆ Light-weight clusters of ARMs◆ combine LXC and linux-cr

Page 21: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 21 [email protected] Kernel Summit, November 2010 21 [email protected]

Some concrete examplesSome concrete examples

◆ BlueWaters◆ NCSA's most powerful supercomputer◆ checkpointing based on linux-cr

◆ OpenVZ◆ VPS hosting with migration capabilities

◆ Canonical / Ubuntu◆ add LXC & linux-cr in UEC cluster stack

Page 22: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 22 [email protected] Kernel Summit, November 2010 22 [email protected]

Who and WhoWho and Who

◆ Who is doing ?◆ Matt Helsley, Dan Smith, Serge Hallyn,

Nathan Lynch, Sukadev Bhattiprolu, me

◆ Who is interested ?◆ IBM, Canonical, OpenVZ, HPC industry,

Kerrighed, Google (?), ...

◆ Who else does/did ?◆ OS: AIX, OpenVZ, IRIX, Cray...◆ Systems: Moab, BLCR/Beowolf, Condor...

Page 23: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 23 [email protected] Kernel Summit, November 2010 23 [email protected]

Linux-C/R design goalsLinux-C/R design goals

◆ Transparency◆ Reliability◆ Security/safety◆ Performance◆ Maintainability

Page 24: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 24 [email protected] Kernel Summit, November 2010 24 [email protected]

Linux-C/R design goalsLinux-C/R design goals

◆ Transparency◆ applications oblivious to operation◆ allow notify of checkpoint or restart◆ allow application awareness

◆ Reliability◆ Security/safety◆ Performance◆ Maintainability

Page 25: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 25 [email protected] Kernel Summit, November 2010 25 [email protected]

Linux-C/R design goalsLinux-C/R design goals

◆ Transparency◆ Reliability

◆ checkpoint succeeds → restart succeeds◆ report non-checkpoint-able reasons◆ checkpoint is non-intrusive

◆ Security/safety◆ Performance◆ Maintainability

Page 26: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 26 [email protected] Kernel Summit, November 2010 26 [email protected]

Linux-C/R design goalsLinux-C/R design goals

◆ Transparency◆ Reliability◆ Security/safety

◆ ptrace capabilities to checkpoint◆ reuse kernel code to reconstruct state

◆ Performance◆ Maintainability

Page 27: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 27 [email protected] Kernel Summit, November 2010 27 [email protected]

Linux-C/R design goalsLinux-C/R design goals

◆ Transparency◆ Reliability◆ Security/safety◆ Performance

◆ zero impact on performance◆ reasonable code footprint

◆ Maintainability

Page 28: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 28 [email protected] Kernel Summit, November 2010 28 [email protected]

Linux-C/R design goalsLinux-C/R design goals

◆ Transparency◆ Reliability◆ Security/safety◆ Performance◆ Maintainability

◆ next slide ...

Page 29: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 29 [email protected] Kernel Summit, November 2010 29 [email protected]

MaintainabilityMaintainability

◆ Placement of C/R code◆ Extensive test-suite◆ Positive experience so far◆ Impact on developers

Page 30: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 30 [email protected] Kernel Summit, November 2010 30 [email protected]

MaintainabilityMaintainability

◆ Placement of C/R code◆ generic code in kernel/checkpoint/...◆ most c/r code with or near subsystem

code so subsytem maintainers sees it◆ c/r is well documented

◆ Extensive test-suite◆ Positive experience so far◆ Impact on developers

Page 31: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 31 [email protected] Kernel Summit, November 2010 31 [email protected]

MaintainabilityMaintainability

◆ Placement of C/R code◆ Extensive test-suite

◆ test large list (>120) of scenarios◆ test before/during/after behavior

◆ Positive experience so far◆ Impact on developers

Page 32: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 32 [email protected] Kernel Summit, November 2010 32 [email protected]

MaintainabilityMaintainability

◆ Placement of C/R code◆ Extensive test-suite◆ Positive experience (2.6.27 → today)

◆ can ignore most kernel changes◆ mainly need to add features◆ minor changes to prior c/r code

● e.g. splice/pipe, syscalls #s, mm helpers

◆ Impact on developers

Page 33: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 33 [email protected] Kernel Summit, November 2010 33 [email protected]

MaintainabilityMaintainability

◆ Placement of C/R code◆ Extensive test-suite◆ Positive experience so far◆ Impact on developers

◆ understand what may affect c/r code◆ at least notify c/r people when needed◆ awareness will grow with exposure

Page 34: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 34 [email protected] Kernel Summit, November 2010 34 [email protected]

Design SummaryDesign Summary

◆ Save/restore state in-kernel◆ Checkpoint container/subtree/self◆ Image holds “user-visible” state◆ Userspace image conversion◆ Detailed error reporting

Page 35: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 35 [email protected] Kernel Summit, November 2010 35 [email protected]

CheckpointCheckpoint

(1) Freeze process hierarchy

(2) Save global data

(3) Save process hierarchy

(4) Save state of all tasks

(?) Filesystem snapshot

(5) Thaw/kill process hierarchy

in-kernel

Page 36: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 36 [email protected] Kernel Summit, November 2010 36 [email protected]

RestartRestart

(1) Create container

(?) Restore (stage) filesystem

(3) Create process hierarchy

(4) Restore state of all tasks

(5) Resume execution

in-kernel

Page 37: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 37 [email protected] Kernel Summit, November 2010 37 [email protected]

Current StateCurrent State

◆ Supported subsystems:◆ tasks (threads, signals, credentials, etc)◆ namespaces (all but mounts-ns)◆ sysvipc (shm, msg, sem)◆ files, dirs (regular, fifos/pipes, epoll,

event, simple devices)◆ sockets (unix, ipv4, ipv6)◆ security (smack, selinux labels)

Page 38: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 38 [email protected] Kernel Summit, November 2010 38 [email protected]

Current StateCurrent State

◆ What's missing◆ [reviewed] file locks, leases, owner◆ [reviewed] unlinked files/dirs◆ [wip] fanotify/inotify/dnotify◆ [wip] mounts, mount-ns◆ /proc filesystem◆ ptraced tasks◆ more devices

Page 39: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 39 [email protected] Kernel Summit, November 2010 39 [email protected]

Current StateCurrent State

◆ Supported architectures:◆ x86-32◆ x86-64◆ s390x◆ PowerPC◆ ARM

Page 40: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 40 [email protected] Kernel Summit, November 2010 40 [email protected]

Current StateCurrent State

◆ Code:◆ ~23K lines in total

● ~1200 lines documentation● ~600 lines per arch (x5)● ~8K lines in kernel/checkpoint/* (base)● ~7K lines “near-place” files (*/checkpoint.c)● ~2K lines “in-place” save/restore● ~2K lines for LSM

Page 41: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 41 [email protected] Kernel Summit, November 2010 41 [email protected]

Discussion ...Discussion ...

◆ Concrete path to mainline (mm/next?)◆ Exposure to subsystem maintainers ?◆ Image format tied to kernel version

(userspace conversion tools)

Many thanks to those who reviewed, tested, and provided suggestions !

● Web page: http://www.linux-cr.org/● Git tree(s): git://www.linux-cr.org/git/