N problems of Linux containers

N Problemsof Linux Containers

(with solutions!)

Kir Kolyshkin<[email protected]>

6 June 2015 ContainerDays Boston

mailto:[email protected]

openvz.org || criu.org || odin.com

Problem: Effective virtualization

● Virtualization is partitioning● Historical way: $M mainframes● Modern way: virtual machines● Problem: performance overhead● Partial solution: hardware support

(Intel VT, AMD V)


Solution: isolation

● Run many userspace instanceson top of one single (Linux) kernel

● All processes see each other– files, process information, network,

shared memory, users, etc.

● Make them unsee it!


One historical way to unsee

chroot()


Namespaces

● Implemented in the Linux kernel– PID (process tree)– net (net devices, addresses, routing etc)– IPC (shared memory, semaphores, msg queues)– UTS (hostname, kernel version)– mnt (filesystem mounts)– user (UIDs/GIDs)

● clone() with CLONE_NEW* flags


Problem: Shared resources

● All containers share the same set of resources (CPU, RAM, disk, various in-kernel things ...)

● Need fair distribution of “goods” so everyone gets their share

● Need DoS prevention● Need prioritization and SLAs


Solution: OpenVZ resource controls

● OpenVZ:– user beancounters

● controls 20 parameters

– hierarchical CPU scheduler– disk quota per containers– I/O priority and I/O bandwidth limit per-container

● Dynamic control, can “resize” runtime


Solution 2: VSwap

● Only two primary parameters: RAM and swap– others still exist, but are optional

● Swap is virtual, no actual I/O is performed● Slow down to emulate real swap● Only when actual global RAM shortage

occurs,virtual swap goes into the real swap

● Currently only available in OpenVZ kernel


Solution: cgroups + controllers

● Cgroups is a mechanism to control resources per hierarchical groups of processes

● Cgroups is nothing without controllers:– blkio, cpu, cpuacct, cpuset, devices, freezer,

memory, net_cls, net_prio

● Cgroups are orthogonal to namespaces● Still working on it: just added kmem controller


Solution 3: vcmmd

● 4th generation of OpenVZ resource mgmt● A user-space daemon using kernel controls● Monitors usage, tweaks limits● Adds a “time” dimension● More flexible limits, e.g. burstable


Problem: fast live migration

● We can already live migratea running OpenVZ containerfrom one server to anotherwithout shutting it down

● We want to do it fast even for huge containers– huge disk: use shared storage– huge RAM: ???


Live migration process(assuming shared storage)

● 1 Freeze the container● 2 Dump its complete state to a dump file● 3 Copy the dump file to destination server● 4 Undump back to RAM, recreate everything● 5 Unfreeze● Problem: huge dump file -- takes long time*

to dump, copy, undump

* seconds


Solution 1: network swap

● 1 Dump the minimal memory, lock the rest● 2 Restore the minimal memory,

mark the rest as swapped out● 3 Set up network swap from the source● 4 Unfreeze. Missing RAM will be “swapped in”● 5 Migrate the rest of RAM and kill it on source


Solution 1: network swap

● 1 Dump the minimal memory, lock the rest● 2 Copy, undump what we have,

mark the rest as swapped out● 3 Set up network swap served from the source● 4 Unfreeze. Missing RAM will be “swapped in”● 5 Migrate the rest of RAM and kill it on source● PROBLEM: no way to rollback


Solution 2: Iterative RAM migration

● 1 Ask kernel to track modified pages● 2 Copy all memory to destination system mem● 3 Ask kernel for list of modified pages● 4 Copy those pages● 5 GOTO 3 until satisfied● 6 Freeze and do migration as usual, but

with much smaller set of pages


Problem: upstreaming

● OpenVZ was developed separately● Same for many past IBM Linux projects

(ELVM, CKRM, ...)● Develop, then merge it upstream

(i.e. to vanilla Linux kernel)● Problem?


Problem: upstreaming

● OpenVZ was developed separately● Same for many past IBM Linux projects

(ELVM, CKRM, ...)● Develop, then merge it upstream

(i.e. to vanilla Linux kernel)● Problem:

grizzly bears upstream developersdo not accept massive patchsetsappearing out of nowhere


Solution 1: rewrite from scratch

● User Beancounters -> CGroups + controllers● PID namespace: 2 rewrites until accepted● Network namespace – rewritten● It works!● 1500+ patches ended up in vanilla● OpenVZ made it to top10 contributors


Solution 2: circumvent the system!

● We tried hard to merge checkpoint/restore● Other people tried hard too, no luck● Can't make it to the kernel? Let's riot!

implement it in userspace● With minimal kernel intervention when

required● Kernel exports most of information already, so

let's just add missing bits and pieces


CRIU

● Checkpoint / Restore [mostly] In Userspace● About 3 years old, tools at version 1.6● Users: Google, Samsung, Huawei, ...● LXC & Docker – integrated!● Already in upstream 3.x kernel

CONFIG_CHECKPOINT_RESTORE● Live migration: P.Haul http://criu.org/P.Haul


CRIU Linux kernel patches, per vTotal: 176 (+11 this year)


Problem: common file system

● Container is just a directory on the host we chroot() into● File system journal (metadata updates) is a bottleneck● Lots of small-size files I/O on CT backup/migration

(sometimes rsync hangs or OOMs!)● No sub-tree disk quota support in upstream● No sub-tree snapshots● Live migration: rsync -- changed inodes● File system type and properties are fixed, same for all

CTs


Solution 1: LVM

● Only works only on top of block device● Hard to manage

(e.g. how to migrate a huge volume?)● No thin provisioning


Solution 2: loop device(filesystem within a file)

● VFS operations leads to double page-caching– (already fixed in the recent kernels)

● No thin provisioning● Limited feature set


Solution 3: ZFS + zvol

● PRO: features– zvol, thin provisioning, dedup, zfs send/receive

● CONTRA: – Licensing is problematic– Linux port issues (people report cache OOM)– Was not available in 2008


Solution 4: ploop

● Basic idea: same as block loop, just better● Modular design:

– various image formats (qcow2 in TODO progress)– various I/O backends (ext4, vfs O_DIRECT, nfs)

● Feature rich:– online resize (grow and shrink, ballooning)– instant live snapshots– write tracker to facilitate faster live migration


Any problems questions?

● [email protected]● Twitter: @kolyshkin @_openvz_ @__criu__

mailto:[email protected]

N problems of Linux containers

Software

openvz kernel

virtual swap

network swap

rest of ram

kernel controls

shared storage huge

real swap

huge dump file