N Problems of Linux Containers (with solutions!) Kir Kolyshkin <[email protected] > 6 June 2015 ContainerDays Boston
Aug 12, 2015
N Problemsof Linux Containers
(with solutions!)
Kir Kolyshkin<[email protected]>
6 June 2015 ContainerDays Boston
openvz.org || criu.org || odin.com
Problem: Effective virtualization
● Virtualization is partitioning● Historical way: $M mainframes● Modern way: virtual machines● Problem: performance overhead● Partial solution: hardware support
(Intel VT, AMD V)
openvz.org || criu.org || odin.com
Solution: isolation
● Run many userspace instanceson top of one single (Linux) kernel
● All processes see each other– files, process information, network,
shared memory, users, etc.
● Make them unsee it!
openvz.org || criu.org || odin.com
One historical way to unsee
chroot()
openvz.org || criu.org || odin.com
Namespaces
● Implemented in the Linux kernel– PID (process tree)– net (net devices, addresses, routing etc)– IPC (shared memory, semaphores, msg queues)– UTS (hostname, kernel version)– mnt (filesystem mounts)– user (UIDs/GIDs)
● clone() with CLONE_NEW* flags
openvz.org || criu.org || odin.com
Problem: Shared resources
● All containers share the same set of resources (CPU, RAM, disk, various in-kernel things ...)
● Need fair distribution of “goods” so everyone gets their share
● Need DoS prevention● Need prioritization and SLAs
openvz.org || criu.org || odin.com
Solution: OpenVZ resource controls
● OpenVZ:– user beancounters
● controls 20 parameters
– hierarchical CPU scheduler– disk quota per containers– I/O priority and I/O bandwidth limit per-container
● Dynamic control, can “resize” runtime
openvz.org || criu.org || odin.com
Solution 2: VSwap
● Only two primary parameters: RAM and swap– others still exist, but are optional
● Swap is virtual, no actual I/O is performed● Slow down to emulate real swap● Only when actual global RAM shortage
occurs,virtual swap goes into the real swap
● Currently only available in OpenVZ kernel
openvz.org || criu.org || odin.com
Solution: cgroups + controllers
● Cgroups is a mechanism to control resources per hierarchical groups of processes
● Cgroups is nothing without controllers:– blkio, cpu, cpuacct, cpuset, devices, freezer,
memory, net_cls, net_prio
● Cgroups are orthogonal to namespaces● Still working on it: just added kmem controller
openvz.org || criu.org || odin.com
Solution 3: vcmmd
● 4th generation of OpenVZ resource mgmt● A user-space daemon using kernel controls● Monitors usage, tweaks limits● Adds a “time” dimension● More flexible limits, e.g. burstable
openvz.org || criu.org || odin.com
Problem: fast live migration
● We can already live migratea running OpenVZ containerfrom one server to anotherwithout shutting it down
● We want to do it fast even for huge containers– huge disk: use shared storage– huge RAM: ???
openvz.org || criu.org || odin.com
Live migration process(assuming shared storage)
● 1 Freeze the container● 2 Dump its complete state to a dump file● 3 Copy the dump file to destination server● 4 Undump back to RAM, recreate everything● 5 Unfreeze● Problem: huge dump file -- takes long time*
to dump, copy, undump
* seconds
openvz.org || criu.org || odin.com
Solution 1: network swap
● 1 Dump the minimal memory, lock the rest● 2 Restore the minimal memory,
mark the rest as swapped out● 3 Set up network swap from the source● 4 Unfreeze. Missing RAM will be “swapped in”● 5 Migrate the rest of RAM and kill it on source
openvz.org || criu.org || odin.com
Solution 1: network swap
● 1 Dump the minimal memory, lock the rest● 2 Copy, undump what we have,
mark the rest as swapped out● 3 Set up network swap served from the source● 4 Unfreeze. Missing RAM will be “swapped in”● 5 Migrate the rest of RAM and kill it on source● PROBLEM: no way to rollback
openvz.org || criu.org || odin.com
Solution 2: Iterative RAM migration
● 1 Ask kernel to track modified pages● 2 Copy all memory to destination system mem● 3 Ask kernel for list of modified pages● 4 Copy those pages● 5 GOTO 3 until satisfied● 6 Freeze and do migration as usual, but
with much smaller set of pages
openvz.org || criu.org || odin.com
Problem: upstreaming
● OpenVZ was developed separately● Same for many past IBM Linux projects
(ELVM, CKRM, ...)● Develop, then merge it upstream
(i.e. to vanilla Linux kernel)● Problem?
openvz.org || criu.org || odin.com
Problem: upstreaming
● OpenVZ was developed separately● Same for many past IBM Linux projects
(ELVM, CKRM, ...)● Develop, then merge it upstream
(i.e. to vanilla Linux kernel)● Problem:
grizzly bears upstream developersdo not accept massive patchsetsappearing out of nowhere
openvz.org || criu.org || odin.com
Solution 1: rewrite from scratch
● User Beancounters -> CGroups + controllers● PID namespace: 2 rewrites until accepted● Network namespace – rewritten● It works!● 1500+ patches ended up in vanilla● OpenVZ made it to top10 contributors
openvz.org || criu.org || odin.com
Solution 2: circumvent the system!
● We tried hard to merge checkpoint/restore● Other people tried hard too, no luck● Can't make it to the kernel? Let's riot!
implement it in userspace● With minimal kernel intervention when
required● Kernel exports most of information already, so
let's just add missing bits and pieces
openvz.org || criu.org || odin.com
CRIU
● Checkpoint / Restore [mostly] In Userspace● About 3 years old, tools at version 1.6● Users: Google, Samsung, Huawei, ...● LXC & Docker – integrated!● Already in upstream 3.x kernel
CONFIG_CHECKPOINT_RESTORE● Live migration: P.Haul http://criu.org/P.Haul
openvz.org || criu.org || odin.com
CRIU Linux kernel patches, per vTotal: 176 (+11 this year)
openvz.org || criu.org || odin.com
Problem: common file system
● Container is just a directory on the host we chroot() into● File system journal (metadata updates) is a bottleneck● Lots of small-size files I/O on CT backup/migration
(sometimes rsync hangs or OOMs!)● No sub-tree disk quota support in upstream● No sub-tree snapshots● Live migration: rsync -- changed inodes● File system type and properties are fixed, same for all
CTs
openvz.org || criu.org || odin.com
Solution 1: LVM
● Only works only on top of block device● Hard to manage
(e.g. how to migrate a huge volume?)● No thin provisioning
openvz.org || criu.org || odin.com
Solution 2: loop device(filesystem within a file)
● VFS operations leads to double page-caching– (already fixed in the recent kernels)
● No thin provisioning● Limited feature set
openvz.org || criu.org || odin.com
Solution 3: ZFS + zvol
● PRO: features– zvol, thin provisioning, dedup, zfs send/receive
● CONTRA: – Licensing is problematic– Linux port issues (people report cache OOM)– Was not available in 2008
openvz.org || criu.org || odin.com
Solution 4: ploop
● Basic idea: same as block loop, just better● Modular design:
– various image formats (qcow2 in TODO progress)– various I/O backends (ext4, vfs O_DIRECT, nfs)
● Feature rich:– online resize (grow and shrink, ballooning)– instant live snapshots– write tracker to facilitate faster live migration
openvz.org || criu.org || odin.com
Any problems questions?
● [email protected]● Twitter: @kolyshkin @_openvz_ @__criu__