Lightweight Virtualization in Linux

بسم ال الرحمن الرحیمبسم ال الرحمن الرحیمSharif University of

TechnologyData and Network

Security Lab.

Lightweight Virtualization in LinuxLightweight Virtualization in Linux

Sadegh Dorri N.Sadegh Dorri N.

PhD CandidatePhD Candidate

Data and Network Security Lab. Seminar, 4 Aban 1393

The Need for VirtualizationThe Need for Virtualization

Hypervisors are the living proof of operating system's incompetence!

Scheduling a Multi-process “application”Scheduling a Multi-process “application”- Nice, priority, etc. are hard to be dynamically managed

Kernel Memory ManagementKernel Memory Management- Fork bumps- $ while true; do mkdir x; cd x; done

Abuse should be the application's problem, rather Abuse should be the application's problem, rather than being everyone's!than being everyone's!

The failure of operating systems and how we can fix it: http://lwn.net/Articles/524952/

AgendaAgenda

MotivationMotivation- Virtualization architectures- OS-level virtualization in Linux

A demoA demo

Under the hoodUnder the hood- LXC components- Related kernel features: cgroups and namespaces

Security considerationsSecurity considerations

ConclusionConclusion

Various Virtualization ArchitecturesVarious Virtualization Architectures

Hardware VirtualizationHardware Virtualization

VMware, Parallels, QEmu, Bochs, Xen, KVMVMware, Parallels, QEmu, Bochs, Xen, KVM

Resources cannot be shared between VMs.Resources cannot be shared between VMs.

OS-Level VirtualizationOS-Level Virtualization

Linux Containers (LXC), Linux-VServer, OpenVZ, Parallels Virtuozzo Linux Containers (LXC), Linux-VServer, OpenVZ, Parallels Virtuozzo ContainersContainers

FreeBSD jailsFreeBSD jails

Solaris Containers/ZonesSolaris Containers/Zones

IBM AIX6 WPARs (Workload Partitions)IBM AIX6 WPARs (Workload Partitions)

OS-Level Virtualization in LinuxOS-Level Virtualization in Linux

Linux ContainersLinux Containers- Allow a kernel to support more resource-isolation use-

cases- Without the overhead and complexity of running multiple

kernel and driver instances

BenefitsBenefits- Isolation- Small footprint- Speed

3) Speed3) Speed

2) Footprint2) Footprint

On a typical physical server, with average compute On a typical physical server, with average compute resources, you can easily run:resources, you can easily run:- 10-100 virtual machines- 100-1000 containers

On disk, containers can be very light.On disk, containers can be very light.- A few MB — even without fancy storage.

1) Isolation1) Isolation

Each container has:Each container has:

Its own network interface (and IP address)Its own network interface (and IP address)- can be bridged, routed... just like VMs

Its own filesystemIts own filesystem- Debian host can run Fedora container (& vice-versa)

Isolation (security)Isolation (security)- container A & B can't harm (or even see) each other

Isolation (resource usage)Isolation (resource usage)- soft & hard quotas for RAM, CPU, I/O...

Possibility of process checkpoint/freeze and migrationPossibility of process checkpoint/freeze and migration- Isolation prevents resource name conflicts

Use-Cases: DevelopersUse-Cases: Developers

Continuous IntegrationContinuous Integration- After each commit, run 100 tests in 100 environments

Continuous PackagingContinuous Packaging- Example: Project Builder

Escape dependency hellEscape dependency hell- Build (and/or run) in a controlled environment

Put everything in a containerPut everything in a container- Even the tiny things

Use-Cases: Hosting ProvidersUse-Cases: Hosting Providers

CheapCheap Cheaper Hosting (VPS providers) Cheaper Hosting (VPS providers)

Give away more free stuffGive away more free stuff- "Pay for your production, get your staging for free!"- Spin up/down on demand, in seconds- Example: dotCloud

““Google has built their entire datacenter infrastructure around Linux containers, launching more than

2 billion containers per week.””

(Kubernetes: open source Google cloud platform)(Kubernetes: open source Google cloud platform)

Use-Cases: EveryoneUse-Cases: Everyone

Look inside your VMsLook inside your VMs- You can see (and kill) individual processes- You can browse (and change) the filesystem

Do (almost) whatever you did with VMsDo (almost) whatever you did with VMs- ... But faster

MigrationMigration- Checkpoint then unfreeze: experimental (CRIU)

Solutions in LinuxSolutions in Linux

OpenVZOpenVZ

Modified Linux kernelModified Linux kernel- Also works with unpatched Linux 3.x (reduced feature set)

Each container is a separate entity with its own:Each container is a separate entity with its own:- Files: System libraries, applications, virtualized /proc and /sys,

virtualized locks, etc.- Users and groups: its own root user, as well as other users and

groups.- Process tree: only sees its own processes (incl. init)- Network: virtual network device with own IP addresses, iptables,

and routing rules.- Devices: can be granted access to real devices.- IPC objects: shared memory, semaphores, messages.

LXC (LinuX Containers)LXC (LinuX Containers)

Container: Container: - Provides an env. like a standard Linux installation but without the

need for a separate kernel.- Single kernel and drivers, multiple different user spaces

A group of processes in Linux in an isolated environment.A group of processes in Linux in an isolated environment.- From inside: looks like a VM- From outside: looks like normal processes- Something (conceptually) in the middle between a chroot on steroids

and a full fledged VM

LXC vs. OpenVZLXC vs. OpenVZ- OpenVZ: production ready and stable; pushing to the upstream- LXC: a work-in-progress; uses standard kernel features

LXC LifecycleLXC Lifecycle

lxc-createlxc-create- Setup a container (root filesystem and

config)

lxc-startlxc-start- Boot the container (by default, you get a

console)

lxc-consolelxc-console- Attach a console (if you started in

background)

lxc-stoplxc-stop- Shutdown the container

lxc-destroylxc-destroy- Destroy the filesystem created with lxc-

create

See also: LXC Web Panel - http://lxc-webpanel.github.io/

Demo...Demo...

Under the HoodUnder the Hood

LXC ComponentsLXC Components

Components:Components:- The liblxc library- Several language bindings for the API:

● Python, lua, Go, ruby, Haskell

- A set of standard tools to control the containers- Container templates

Open source!Open source!

https://linuxcontainers.org/https://linuxcontainers.org/

Features Making up LXC Features Making up LXC

Kernel features used in LXC:Kernel features used in LXC:- Isolation:

● Kernel namespaces (ipc, uts, mount, pid, network and user)● Chroots (using pivot_root)

- Resource management● Control groups (cgroups)

- Security:● Apparmor and SELinux profiles● Seccomp policies● Kernel capabilities

Pivot_root and ChrootPivot_root and Chroot

Change the root directory to a Change the root directory to a new pathnew path- Pivot_root: switches the

complete system and remove dependencies on the old root dir.

- Chroot: applied on a single process

SeccompSeccomp

seccomp (SECure COMPuting mode) seccomp (SECure COMPuting mode) - A simple sandboxing mechanism (Linux 2.6.12+ (2005))- Allows a process to make a one-way transition into a "secure" state

● Syscalls limited to exit(), sigreturn(), read() and write() to already-open file descriptors.

- Any attempts for other system calls result in SIGKILL.

seccomp-bpf seccomp-bpf - An extension to seccomp that allows filtering of system calls using a

configurable policy - Used by OpenSSH and vsftpd as well as Google Chrome/Chromium

on Chrome OS and Linux to sandbox Flash player and renderers.

CapabilitiesCapabilities

In traditional UNIX, processes are:In traditional UNIX, processes are:- Privileged (EUID is 0): Bypass all kernel permission checks.- Unprivileged: full permission checking (EUID, EGID, and supplementary group

list).

Since Linux kernel 2.2:Since Linux kernel 2.2:- The superuser privileges are divided into distinct units (a.k.a. as capabilities)- Capabilities can be independently enabled and disabled (per-thread)

Examples:Examples:- CAP_CHOWN: Make arbitrary changes to file UIDs and GIDs.- CAP_KILL: Bypass permission checks for sending signals.- CAP_NET_ADMIN: Perform various network-related operations.- CAP_SYS_ADMIN- CAP_SYS_BOOT: Use reboot and kexec_load

Linux Security Modules (LSM)Linux Security Modules (LSM)

A Linux kernel framework to support different A Linux kernel framework to support different security modelssecurity models- Avoids favoritism toward any single implementation. - Examples: AppArmor, SELinux, Smack and TOMOYO

Used to implement different Used to implement different MACsMACs

Access Control

Control GroupsControl Groups

Introduction to CGroupsIntroduction to CGroups

Cgroups (control groups): Cgroups (control groups): - Allocate resources (CPU, memory, network, or their combinations)

among user-defined groups of tasks (processes) - Think ulimit, but for groups of processes ... and with fine-grained

accounting.- Initiated at Google (2006)- Available in Fedora 18 kernel and ubuntu 12.10 kernel (also some

previous releases).

Commands:Commands:- cgcreate: creates new cgroup- cgset: sets parameters for given cgroup(s)- cgexec: runs a task in specified control groups.

CGroups: ImplementationCGroups: Implementation

Implemented as a special cgroup file systemImplemented as a special cgroup file system- libcgroup is a library that abstracts the control group file system in

Linux.- CGroup services: Allow persistence across reboot and ease of use.

A few simple hooks inserted into the kernel (not performance-A few simple hooks inserted into the kernel (not performance-critical):critical):- In boot phase, process creation and destroy methods, task_struct

procfs entries:procfs entries:● For each process: /proc/pid/cgroup.● System-wide: /proc/cgroups

CGroup SubsystemsCGroup Subsystems

cpucpu- control CPU scheduler

cpuacctcpuacct- generates automatic reports on CPU resources

cpusetcpuset- assigns individual CPUs (cores) and memory nodes

memorymemory- limits memory use + generates automatic reports on memory resources

freezerfreezer- suspends or resumes tasks in a cgroup.

CGroups Subsystems (cont'd)CGroups Subsystems (cont'd)

blkioblkio- limits on block devices IO (disk, solid state, USB, etc.).

devices:devices:- allows/denies access to devices

net_clsnet_cls- differentiates between packets of different cgroups.

net_prionet_prio- dynamically set the priority of network traffic per network

interface.

Cgroups: BasicsCgroups: Basics

Everything exposed through a virtual filesystemEverything exposed through a virtual filesystem- /cgroup, /sys/fs/cgroup... YourMountpointMayVary

Create a cgroup:Create a cgroup:- mkdir /cgroup/aloha- Automatically creates these files: tasks, tasks, cgroup.procs,

Move process with PID 1234 to the cgroup:Move process with PID 1234 to the cgroup:- echo 1234 > /cgroup/aloha/tasks

Limit memory usage:Limit memory usage:- echo 10000000 > /cgroup/aloha/memory.limit_in_bytes

CPUset SubsystemCPUset Subsystem

Each subsystem adds specific control files for its Each subsystem adds specific control files for its own needsown needs- Prefixed by its name

cpuset.cpuscpuset.sched_relax_domain_level

cpuset.memscpuset.memory_migrate

cpuset.cpu_exclusivecpuset.memory_pressure

cpuset.mem_exclusivecpuset.memory_spread_page

cpuset.mem_hardwallcpuset.memory_spread_slab

cpuset.sched_load_balancecpuset.memory_pressure_enabled

CGroup: CPU (and Friends)CGroup: CPU (and Friends)

LimitingLimiting- Set cpu.shares (defines relative weights)

AccountingAccounting- Check cpustat.usage for user/system breakdown

IsolateIsolate- Use cpuset.cpus (also for NUMA systems)

Can't really throttle a group of process.Can't really throttle a group of process.- But that's OK: context-switching << 1/HZ

CGroup: MemoryCGroup: Memory

Up to 25 control filesUp to 25 control files

LimitingLimiting- memory usage, swap usage- soft limits and hard limits- can be nested

AccountingAccounting- cache vs. rss- active vs. inactive- file-backed pages vs. anonymous pages- page-in/page-out

IsolationIsolation- Reserve memory thanks to hard limits

CGroup: Block I/OCGroup: Block I/O

Limiting & IsolationLimiting & Isolation- blkio.throttle.{read,write}.{iops,bps}.device- Drawback: only for sync I/O (i.e.: "classical" reads; not

writes; not mapped files)

AccountingAccounting- Number of IOs, bytes, service time...- Drawback: same as previously

CGroups aren't perfect to limit I/OCGroups aren't perfect to limit I/O- Limiting the amount of dirty memory helps a bit.

NamespacesNamespaces

Linux NamespacesLinux Namespaces

Namespaces: Lightweight process virtualizationNamespaces: Lightweight process virtualization- Isolation: Enable a process (or several processes) to have

different views of the system than other processes.- Idea dates back to 1992 (Plan 9)

Introduced in Linux 2.4.19 (2002)Introduced in Linux 2.4.19 (2002)- User namespace was the last ns: A number of Linux filesystems

are not user-namespace aware, yet!

User space modificationUser space modification- No modification is needed (in general)- Some utilities are made namespace-aware (iproute, util-linux)

Different Kinds of NamespacesDifferent Kinds of Namespaces

There are currently 6 namespaces in Linux:There are currently 6 namespaces in Linux:- pid (processes)- net (network interfaces, routing...)- ipc (System V IPC)- mnt (mount points, filesystems)- uts (hostname)- user (UIDs)

4 other namespaces are not implemented (yet):4 other namespaces are not implemented (yet):- Security, security keys, device, and time

All require CAP_SYS_ADMINAll require CAP_SYS_ADMIN- Except user namespaces (not privileged)- All the other ones can be created in conjunction with a new user

namespace.

Namespaces ImplementationNamespaces Implementation

Namespaces do not have namesNamespaces do not have names- Each namespace has a unique inode number (Linux 3.8+)- inode number of each namespace is created when the namespace

is created.- There is an initial, default namespace for each namespace.

ls -al /proc/<pid>/ns ls -al /proc/<pid>/ns - lrwxrwxrwx 1 root root 0 Apr 24 17:29 ipc -> ipc:[4026531839]- lrwxrwxrwx 1 root root 0 Apr 24 17:29 mnt -> mnt:[4026531840]- lrwxrwxrwx 1 root root 0 Apr 24 17:29 net -> net:[4026531956]- lrwxrwxrwx 1 root root 0 Apr 24 17:29 pid -> pid:[4026531836]- lrwxrwxrwx 1 root root 0 Apr 24 17:29 user -> user:[4026531837]- lrwxrwxrwx 1 root root 0 Apr 24 17:29 uts -> uts:[4026531838]

Trivial NamespacesTrivial Namespaces

UTS (hostname)UTS (hostname)- gethostname(),sethostname()- struct system_utsname per container

Sys V IPC: shmem, semaphores, msg queuesSys V IPC: shmem, semaphores, msg queues- Keys must be mutually agreed upon by both client and

server processes- ipc namespace: uniqueness context of keys

Namespaces: pidNamespaces: pid

Usually a PID is an arbitrary numberUsually a PID is an arbitrary number

Special cases:Special cases:- Init (i.e. child reaper) has a PID of 1- Can't change PID (process migration)

Namespaces: pid (cont'd)Namespaces: pid (cont'd)

PID is no longer unique in kernelPID is no longer unique in kernel- A process has (can have) different PIDs in each ns- /proc/$PID/* is virtualized

PID namespaces are nestedPID namespaces are nested- Processes in a PID ns can't see/affect processes of the

parent ns- But all PIDs in the ns are visible to the parent ns.

PID 1PID 1

Each PID namespace has a PID #1Each PID namespace has a PID #1- Its first process

Behavior like the “init” process:Behavior like the “init” process:- When a process dies, all its orphaned children will

now have the process with PID 1 as their parent.- Sending SIGKILL signal does not kill process 1,

regardless of which namespace the command was issued (initial namespace or other pid namespace).

An important feature for containersAn important feature for containers

Namespaces: netNamespaces: net

Logically another copy of the network stack, with Logically another copy of the network stack, with its own separate...its own separate...- Network interfaces (and its own lo/127.0.0.1)- IP address(es) and sockets- routing table(s), iptables rules

Communication between containers:Communication between containers:- UNIX domain sockets (=on the filesystem)- By creating a pair of network devices (veth) and move

one to another namespace (like a pipe.)

Namespaces: mntNamespaces: mnt

In a new mount namespace:In a new mount namespace:- All previous mounts will be visible- But mounts/unmounts in that mount namespace are

invisible to the rest of the system.- Mounts/unmounts in the global namespace are visible in

that namespace.

A mnt namespace can have its own rootfsA mnt namespace can have its own rootfs

Special filesystems must be remounted, e.g.:Special filesystems must be remounted, e.g.:- procfs (to see the processes)- devpts (to see pseudo-terminals)

Namespaces: userNamespaces: user

A process will have distinct A process will have distinct set of UIDs, GIDs and set of UIDs, GIDs and capabilities.capabilities.- UID42 in container X isn't

UID42 in container Y

UID Namespace (example)UID Namespace (example)

Running from some user accountRunning from some user account- id -u → 1000 (effective user ID)- id -g → 1000 (effective group ID)

Capbilties: cat /proc/self/status | grep CapCapbilties: cat /proc/self/status | grep Cap- CapInh: 0000000000000000- CapPrm: 0000000000000000- CapEff: 0000000000000000- CapBnd: 0000001fffffffff

In order to create a user namespace and start a shell, we In order to create a user namespace and start a shell, we will run from that non-root account:will run from that non-root account:- unshare -U /bin/bash

Example (cont'd)Example (cont'd)

Now from the new shell runNow from the new shell run- id -u → 65534- id -g → 65534- These are default values for the eUID and eGUID In the new namespace.- No difference if unshare by the root user

Capabilities: cat /proc/self/status | grep CapCapabilities: cat /proc/self/status | grep Cap- CapInh:0000000000000000- CapPrm:0000000000000000- CapEff:0000000000000000- CapBnd:0000001fffffffff

In fact:In fact:- The namespace had full capabilities, but unshare removed them.- User mapping can be specified in gid_map and uid_map of the created

process.

unshare utilunshare util

Runs a program with some namespace(s) Runs a program with some namespace(s) unshared from parentunshared from parent

ExampleExample- ./unshare --net bash- A new network namespace was generated and the

bash process was generated inside that namespace.- Now ifconfig -a will show only the loopback device

After process termination, After process termination, - The namespace(s) will be freed.

Namespace problems / todosNamespace problems / todos

Missing namespaces: Missing namespaces: - tty, fuse, binfmt_misc

Identifying a namespaceIdentifying a namespace- No namespace ID, just process(es)- Partly solved by inode numbers

Entering existing namespacesEntering existing namespaces- fd=nsfd(NS, PID); setns(fd);- Were not possible in older kernels

Security of ContainersSecurity of Containers

Uncertaities, Fears and DoubtsUncertaities, Fears and Doubts

““LXC is not secure. If I want real security I'll use LXC is not secure. If I want real security I'll use KVM.”KVM.”- Dan Berrange, famous LXC hacker, 2011

Still quoted todayStill quoted today- Still true in some cases- Things have changed a little since 2011

ResponsesResponses- Kernel exploits- Default LXC settings- Containers needing to do funky stuff

Kernel ExploitsKernel Exploits

Kernel exploits: Containers share the kernelKernel exploits: Containers share the kernel- Buggy kernel and syscalls → Game over!- Unless the container is forbidden from those syscalls- seccomp-bpf

Default LXC SettingsDefault LXC Settings

Default LXC settingsDefault LXC settings- Apparmor and SELinux are used to restrict some

actions, by default (intension: stop accidential harm)- Capabilities must be restricted, too.

Full capabilities and permissions inside =

Sudoers access to the guest user!

Containers Needing Extra Priv.Containers Needing Extra Priv.

Network Interfaces for VPN or otherNetwork Interfaces for VPN or other

Multicast, broadcast, packet sniffingMulticast, broadcast, packet sniffing

Raw access to devices (disks, GPU, …)Raw access to devices (disks, GPU, …)

Mounting stuff (even with FUSE)Mounting stuff (even with FUSE)

More privileges=

Greater attack surface

Some SolutionsSome Solutions

Use LXC for packagingUse LXC for packaging- Containers are not used on the target host.

Use LXC for development and testingUse LXC for development and testing- Insider code- Prevents accidental harm to the systems

LXC for Web apps and databasesLXC for Web apps and databases- Shouldn't require extra privileges

Use capabilitiesUse capabilities

Defense in depth!Defense in depth!- Use multiple security mechanisms- Both in the container and in the host (not different from the usual!)

Lightweight Virtualization in Linux

Engineering

Docker - container and lightweight virtualization

Evoluation of Linux Container Virtualization

KVM: Linux-based Virtualization

Lightweight Virtualization with Linux Containers and Docker....

Lightweight Virtualization with Linux Containers and Docker....

Virtualization of Linux based computers: the Linux-VServer.....

Accelerate to the Future - IEI · QNAP Container Station...

Red Hat Enterprise Linux 5 Virtualization

Lightweight virtualization: LXC vs. OpenVZ

Oracle Linux Virtualization Manager · 2019-11-19 ·...

Lightweight Virtualization as Enabling Technology for ......

NetDev: The Technical Conference On Linux …namespaces...

Linux Virtualization Goes Mobile

Lightweight Virtualization as Enabling Technology for...

Linux Virtualization - SCALE 12x

Docker - A lightweight Virtualization Platform for...