Transcript
بسم ال الرحمن الرحیمبسم ال الرحمن الرحیمSharif University of
TechnologyData and Network
Security Lab.
Lightweight Virtualization in LinuxLightweight Virtualization in Linux
Sadegh Dorri N.Sadegh Dorri N.
PhD CandidatePhD Candidate
Data and Network Security Lab. Seminar, 4 Aban 1393
The Need for VirtualizationThe Need for Virtualization
Hypervisors are the living proof of operating system's incompetence!
Scheduling a Multi-process “application”Scheduling a Multi-process “application”- Nice, priority, etc. are hard to be dynamically managed
Kernel Memory ManagementKernel Memory Management- Fork bumps- $ while true; do mkdir x; cd x; done
Abuse should be the application's problem, rather Abuse should be the application's problem, rather than being everyone's!than being everyone's!
The failure of operating systems and how we can fix it: http://lwn.net/Articles/524952/
AgendaAgenda
MotivationMotivation- Virtualization architectures- OS-level virtualization in Linux
A demoA demo
Under the hoodUnder the hood- LXC components- Related kernel features: cgroups and namespaces
Security considerationsSecurity considerations
ConclusionConclusion
Various Virtualization ArchitecturesVarious Virtualization Architectures
Hardware VirtualizationHardware Virtualization
VMware, Parallels, QEmu, Bochs, Xen, KVMVMware, Parallels, QEmu, Bochs, Xen, KVM
Resources cannot be shared between VMs.Resources cannot be shared between VMs.
OS-Level VirtualizationOS-Level Virtualization
Linux Containers (LXC), Linux-VServer, OpenVZ, Parallels Virtuozzo Linux Containers (LXC), Linux-VServer, OpenVZ, Parallels Virtuozzo ContainersContainers
FreeBSD jailsFreeBSD jails
Solaris Containers/ZonesSolaris Containers/Zones
IBM AIX6 WPARs (Workload Partitions)IBM AIX6 WPARs (Workload Partitions)
OS-Level Virtualization in LinuxOS-Level Virtualization in Linux
Linux ContainersLinux Containers- Allow a kernel to support more resource-isolation use-
cases- Without the overhead and complexity of running multiple
kernel and driver instances
BenefitsBenefits- Isolation- Small footprint- Speed
3) Speed3) Speed
2) Footprint2) Footprint
On a typical physical server, with average compute On a typical physical server, with average compute resources, you can easily run:resources, you can easily run:- 10-100 virtual machines- 100-1000 containers
On disk, containers can be very light.On disk, containers can be very light.- A few MB — even without fancy storage.
1) Isolation1) Isolation
Each container has:Each container has:
Its own network interface (and IP address)Its own network interface (and IP address)- can be bridged, routed... just like VMs
Its own filesystemIts own filesystem- Debian host can run Fedora container (& vice-versa)
Isolation (security)Isolation (security)- container A & B can't harm (or even see) each other
Isolation (resource usage)Isolation (resource usage)- soft & hard quotas for RAM, CPU, I/O...
Possibility of process checkpoint/freeze and migrationPossibility of process checkpoint/freeze and migration- Isolation prevents resource name conflicts
Use-Cases: DevelopersUse-Cases: Developers
Continuous IntegrationContinuous Integration- After each commit, run 100 tests in 100 environments
Continuous PackagingContinuous Packaging- Example: Project Builder
Escape dependency hellEscape dependency hell- Build (and/or run) in a controlled environment
Put everything in a containerPut everything in a container- Even the tiny things
Use-Cases: Hosting ProvidersUse-Cases: Hosting Providers
CheapCheap Cheaper Hosting (VPS providers) Cheaper Hosting (VPS providers)
Give away more free stuffGive away more free stuff- "Pay for your production, get your staging for free!"- Spin up/down on demand, in seconds- Example: dotCloud
““Google has built their entire datacenter infrastructure around Linux containers, launching more than
2 billion containers per week.””
(Kubernetes: open source Google cloud platform)(Kubernetes: open source Google cloud platform)
Use-Cases: EveryoneUse-Cases: Everyone
Look inside your VMsLook inside your VMs- You can see (and kill) individual processes- You can browse (and change) the filesystem
Do (almost) whatever you did with VMsDo (almost) whatever you did with VMs- ... But faster
MigrationMigration- Checkpoint then unfreeze: experimental (CRIU)
Solutions in LinuxSolutions in Linux
OpenVZOpenVZ
Modified Linux kernelModified Linux kernel- Also works with unpatched Linux 3.x (reduced feature set)
Each container is a separate entity with its own:Each container is a separate entity with its own:- Files: System libraries, applications, virtualized /proc and /sys,
virtualized locks, etc.- Users and groups: its own root user, as well as other users and
groups.- Process tree: only sees its own processes (incl. init)- Network: virtual network device with own IP addresses, iptables,
and routing rules.- Devices: can be granted access to real devices.- IPC objects: shared memory, semaphores, messages.
LXC (LinuX Containers)LXC (LinuX Containers)
Container: Container: - Provides an env. like a standard Linux installation but without the
need for a separate kernel.- Single kernel and drivers, multiple different user spaces
A group of processes in Linux in an isolated environment.A group of processes in Linux in an isolated environment.- From inside: looks like a VM- From outside: looks like normal processes- Something (conceptually) in the middle between a chroot on steroids
and a full fledged VM
LXC vs. OpenVZLXC vs. OpenVZ- OpenVZ: production ready and stable; pushing to the upstream- LXC: a work-in-progress; uses standard kernel features
LXC LifecycleLXC Lifecycle
lxc-createlxc-create- Setup a container (root filesystem and
config)
lxc-startlxc-start- Boot the container (by default, you get a
console)
lxc-consolelxc-console- Attach a console (if you started in
background)
lxc-stoplxc-stop- Shutdown the container
lxc-destroylxc-destroy- Destroy the filesystem created with lxc-
create
See also: LXC Web Panel - http://lxc-webpanel.github.io/
Demo...Demo...
Under the HoodUnder the Hood
LXC ComponentsLXC Components
Components:Components:- The liblxc library- Several language bindings for the API:
● Python, lua, Go, ruby, Haskell
- A set of standard tools to control the containers- Container templates
Open source!Open source!
https://linuxcontainers.org/https://linuxcontainers.org/
Features Making up LXC Features Making up LXC
Kernel features used in LXC:Kernel features used in LXC:- Isolation:
● Kernel namespaces (ipc, uts, mount, pid, network and user)● Chroots (using pivot_root)
- Resource management● Control groups (cgroups)
- Security:● Apparmor and SELinux profiles● Seccomp policies● Kernel capabilities
Pivot_root and ChrootPivot_root and Chroot
Change the root directory to a Change the root directory to a new pathnew path- Pivot_root: switches the
complete system and remove dependencies on the old root dir.
- Chroot: applied on a single process
SeccompSeccomp
seccomp (SECure COMPuting mode) seccomp (SECure COMPuting mode) - A simple sandboxing mechanism (Linux 2.6.12+ (2005))- Allows a process to make a one-way transition into a "secure" state
● Syscalls limited to exit(), sigreturn(), read() and write() to already-open file descriptors.
- Any attempts for other system calls result in SIGKILL.
seccomp-bpf seccomp-bpf - An extension to seccomp that allows filtering of system calls using a
configurable policy - Used by OpenSSH and vsftpd as well as Google Chrome/Chromium
on Chrome OS and Linux to sandbox Flash player and renderers.
CapabilitiesCapabilities
In traditional UNIX, processes are:In traditional UNIX, processes are:- Privileged (EUID is 0): Bypass all kernel permission checks.- Unprivileged: full permission checking (EUID, EGID, and supplementary group
list).
Since Linux kernel 2.2:Since Linux kernel 2.2:- The superuser privileges are divided into distinct units (a.k.a. as capabilities)- Capabilities can be independently enabled and disabled (per-thread)
Examples:Examples:- CAP_CHOWN: Make arbitrary changes to file UIDs and GIDs.- CAP_KILL: Bypass permission checks for sending signals.- CAP_NET_ADMIN: Perform various network-related operations.- CAP_SYS_ADMIN- CAP_SYS_BOOT: Use reboot and kexec_load
Linux Security Modules (LSM)Linux Security Modules (LSM)
A Linux kernel framework to support different A Linux kernel framework to support different security modelssecurity models- Avoids favoritism toward any single implementation. - Examples: AppArmor, SELinux, Smack and TOMOYO
Linux
Used to implement different Used to implement different MACsMACs
Access Control
Control GroupsControl Groups
Introduction to CGroupsIntroduction to CGroups
Cgroups (control groups): Cgroups (control groups): - Allocate resources (CPU, memory, network, or their combinations)
among user-defined groups of tasks (processes) - Think ulimit, but for groups of processes ... and with fine-grained
accounting.- Initiated at Google (2006)- Available in Fedora 18 kernel and ubuntu 12.10 kernel (also some
previous releases).
Commands:Commands:- cgcreate: creates new cgroup- cgset: sets parameters for given cgroup(s)- cgexec: runs a task in specified control groups.
CGroups: ImplementationCGroups: Implementation
Implemented as a special cgroup file systemImplemented as a special cgroup file system- libcgroup is a library that abstracts the control group file system in
Linux.- CGroup services: Allow persistence across reboot and ease of use.
A few simple hooks inserted into the kernel (not performance-A few simple hooks inserted into the kernel (not performance-critical):critical):- In boot phase, process creation and destroy methods, task_struct
procfs entries:procfs entries:● For each process: /proc/pid/cgroup.● System-wide: /proc/cgroups
CGroup SubsystemsCGroup Subsystems
cpucpu- control CPU scheduler
cpuacctcpuacct- generates automatic reports on CPU resources
cpusetcpuset- assigns individual CPUs (cores) and memory nodes
memorymemory- limits memory use + generates automatic reports on memory resources
freezerfreezer- suspends or resumes tasks in a cgroup.
CGroups Subsystems (cont'd)CGroups Subsystems (cont'd)
blkioblkio- limits on block devices IO (disk, solid state, USB, etc.).
devices:devices:- allows/denies access to devices
net_clsnet_cls- differentiates between packets of different cgroups.
net_prionet_prio- dynamically set the priority of network traffic per network
interface.
Cgroups: BasicsCgroups: Basics
Everything exposed through a virtual filesystemEverything exposed through a virtual filesystem- /cgroup, /sys/fs/cgroup... YourMountpointMayVary
Create a cgroup:Create a cgroup:- mkdir /cgroup/aloha- Automatically creates these files: tasks, tasks, cgroup.procs,
etc.
Move process with PID 1234 to the cgroup:Move process with PID 1234 to the cgroup:- echo 1234 > /cgroup/aloha/tasks
Limit memory usage:Limit memory usage:- echo 10000000 > /cgroup/aloha/memory.limit_in_bytes
CPUset SubsystemCPUset Subsystem
Each subsystem adds specific control files for its Each subsystem adds specific control files for its own needsown needs- Prefixed by its name
cpuset.cpuscpuset.sched_relax_domain_level
cpuset.memscpuset.memory_migrate
cpuset.cpu_exclusivecpuset.memory_pressure
cpuset.mem_exclusivecpuset.memory_spread_page
cpuset.mem_hardwallcpuset.memory_spread_slab
cpuset.sched_load_balancecpuset.memory_pressure_enabled
CGroup: CPU (and Friends)CGroup: CPU (and Friends)
LimitingLimiting- Set cpu.shares (defines relative weights)
AccountingAccounting- Check cpustat.usage for user/system breakdown
IsolateIsolate- Use cpuset.cpus (also for NUMA systems)
Can't really throttle a group of process.Can't really throttle a group of process.- But that's OK: context-switching << 1/HZ
CGroup: MemoryCGroup: Memory
Up to 25 control filesUp to 25 control files
LimitingLimiting- memory usage, swap usage- soft limits and hard limits- can be nested
AccountingAccounting- cache vs. rss- active vs. inactive- file-backed pages vs. anonymous pages- page-in/page-out
IsolationIsolation- Reserve memory thanks to hard limits
CGroup: Block I/OCGroup: Block I/O
Limiting & IsolationLimiting & Isolation- blkio.throttle.{read,write}.{iops,bps}.device- Drawback: only for sync I/O (i.e.: "classical" reads; not
writes; not mapped files)
AccountingAccounting- Number of IOs, bytes, service time...- Drawback: same as previously
CGroups aren't perfect to limit I/OCGroups aren't perfect to limit I/O- Limiting the amount of dirty memory helps a bit.
NamespacesNamespaces
Linux NamespacesLinux Namespaces
Namespaces: Lightweight process virtualizationNamespaces: Lightweight process virtualization- Isolation: Enable a process (or several processes) to have
different views of the system than other processes.- Idea dates back to 1992 (Plan 9)
Introduced in Linux 2.4.19 (2002)Introduced in Linux 2.4.19 (2002)- User namespace was the last ns: A number of Linux filesystems
are not user-namespace aware, yet!
User space modificationUser space modification- No modification is needed (in general)- Some utilities are made namespace-aware (iproute, util-linux)
Different Kinds of NamespacesDifferent Kinds of Namespaces
There are currently 6 namespaces in Linux:There are currently 6 namespaces in Linux:- pid (processes)- net (network interfaces, routing...)- ipc (System V IPC)- mnt (mount points, filesystems)- uts (hostname)- user (UIDs)
4 other namespaces are not implemented (yet):4 other namespaces are not implemented (yet):- Security, security keys, device, and time
All require CAP_SYS_ADMINAll require CAP_SYS_ADMIN- Except user namespaces (not privileged)- All the other ones can be created in conjunction with a new user
namespace.
Namespaces ImplementationNamespaces Implementation
Namespaces do not have namesNamespaces do not have names- Each namespace has a unique inode number (Linux 3.8+)- inode number of each namespace is created when the namespace
is created.- There is an initial, default namespace for each namespace.
ls -al /proc/<pid>/ns ls -al /proc/<pid>/ns - lrwxrwxrwx 1 root root 0 Apr 24 17:29 ipc -> ipc:[4026531839]- lrwxrwxrwx 1 root root 0 Apr 24 17:29 mnt -> mnt:[4026531840]- lrwxrwxrwx 1 root root 0 Apr 24 17:29 net -> net:[4026531956]- lrwxrwxrwx 1 root root 0 Apr 24 17:29 pid -> pid:[4026531836]- lrwxrwxrwx 1 root root 0 Apr 24 17:29 user -> user:[4026531837]- lrwxrwxrwx 1 root root 0 Apr 24 17:29 uts -> uts:[4026531838]
Trivial NamespacesTrivial Namespaces
UTS (hostname)UTS (hostname)- gethostname(),sethostname()- struct system_utsname per container
Sys V IPC: shmem, semaphores, msg queuesSys V IPC: shmem, semaphores, msg queues- Keys must be mutually agreed upon by both client and
server processes- ipc namespace: uniqueness context of keys
Namespaces: pidNamespaces: pid
Usually a PID is an arbitrary numberUsually a PID is an arbitrary number
Special cases:Special cases:- Init (i.e. child reaper) has a PID of 1- Can't change PID (process migration)
Namespaces: pid (cont'd)Namespaces: pid (cont'd)
PID is no longer unique in kernelPID is no longer unique in kernel- A process has (can have) different PIDs in each ns- /proc/$PID/* is virtualized
PID namespaces are nestedPID namespaces are nested- Processes in a PID ns can't see/affect processes of the
parent ns- But all PIDs in the ns are visible to the parent ns.
PID 1PID 1
Each PID namespace has a PID #1Each PID namespace has a PID #1- Its first process
Behavior like the “init” process:Behavior like the “init” process:- When a process dies, all its orphaned children will
now have the process with PID 1 as their parent.- Sending SIGKILL signal does not kill process 1,
regardless of which namespace the command was issued (initial namespace or other pid namespace).
An important feature for containersAn important feature for containers
Namespaces: netNamespaces: net
Logically another copy of the network stack, with Logically another copy of the network stack, with its own separate...its own separate...- Network interfaces (and its own lo/127.0.0.1)- IP address(es) and sockets- routing table(s), iptables rules
Communication between containers:Communication between containers:- UNIX domain sockets (=on the filesystem)- By creating a pair of network devices (veth) and move
one to another namespace (like a pipe.)
Namespaces: mntNamespaces: mnt
In a new mount namespace:In a new mount namespace:- All previous mounts will be visible- But mounts/unmounts in that mount namespace are
invisible to the rest of the system.- Mounts/unmounts in the global namespace are visible in
that namespace.
A mnt namespace can have its own rootfsA mnt namespace can have its own rootfs
Special filesystems must be remounted, e.g.:Special filesystems must be remounted, e.g.:- procfs (to see the processes)- devpts (to see pseudo-terminals)
Namespaces: userNamespaces: user
A process will have distinct A process will have distinct set of UIDs, GIDs and set of UIDs, GIDs and capabilities.capabilities.- UID42 in container X isn't
UID42 in container Y
UID Namespace (example)UID Namespace (example)
Running from some user accountRunning from some user account- id -u → 1000 (effective user ID)- id -g → 1000 (effective group ID)
Capbilties: cat /proc/self/status | grep CapCapbilties: cat /proc/self/status | grep Cap- CapInh: 0000000000000000- CapPrm: 0000000000000000- CapEff: 0000000000000000- CapBnd: 0000001fffffffff
In order to create a user namespace and start a shell, we In order to create a user namespace and start a shell, we will run from that non-root account:will run from that non-root account:- unshare -U /bin/bash
Example (cont'd)Example (cont'd)
Now from the new shell runNow from the new shell run- id -u → 65534- id -g → 65534- These are default values for the eUID and eGUID In the new namespace.- No difference if unshare by the root user
Capabilities: cat /proc/self/status | grep CapCapabilities: cat /proc/self/status | grep Cap- CapInh:0000000000000000- CapPrm:0000000000000000- CapEff:0000000000000000- CapBnd:0000001fffffffff
In fact:In fact:- The namespace had full capabilities, but unshare removed them.- User mapping can be specified in gid_map and uid_map of the created
process.
unshare utilunshare util
Runs a program with some namespace(s) Runs a program with some namespace(s) unshared from parentunshared from parent
ExampleExample- ./unshare --net bash- A new network namespace was generated and the
bash process was generated inside that namespace.- Now ifconfig -a will show only the loopback device
After process termination, After process termination, - The namespace(s) will be freed.
Namespace problems / todosNamespace problems / todos
Missing namespaces: Missing namespaces: - tty, fuse, binfmt_misc
Identifying a namespaceIdentifying a namespace- No namespace ID, just process(es)- Partly solved by inode numbers
Entering existing namespacesEntering existing namespaces- fd=nsfd(NS, PID); setns(fd);- Were not possible in older kernels
Security of ContainersSecurity of Containers
Uncertaities, Fears and DoubtsUncertaities, Fears and Doubts
““LXC is not secure. If I want real security I'll use LXC is not secure. If I want real security I'll use KVM.”KVM.”- Dan Berrange, famous LXC hacker, 2011
Still quoted todayStill quoted today- Still true in some cases- Things have changed a little since 2011
ResponsesResponses- Kernel exploits- Default LXC settings- Containers needing to do funky stuff
Kernel ExploitsKernel Exploits
Kernel exploits: Containers share the kernelKernel exploits: Containers share the kernel- Buggy kernel and syscalls → Game over!- Unless the container is forbidden from those syscalls- seccomp-bpf
Default LXC SettingsDefault LXC Settings
Default LXC settingsDefault LXC settings- Apparmor and SELinux are used to restrict some
actions, by default (intension: stop accidential harm)- Capabilities must be restricted, too.
Full capabilities and permissions inside =
Sudoers access to the guest user!
Containers Needing Extra Priv.Containers Needing Extra Priv.
Network Interfaces for VPN or otherNetwork Interfaces for VPN or other
Multicast, broadcast, packet sniffingMulticast, broadcast, packet sniffing
Raw access to devices (disks, GPU, …)Raw access to devices (disks, GPU, …)
Mounting stuff (even with FUSE)Mounting stuff (even with FUSE)
More privileges=
Greater attack surface
Some SolutionsSome Solutions
Use LXC for packagingUse LXC for packaging- Containers are not used on the target host.
Use LXC for development and testingUse LXC for development and testing- Insider code- Prevents accidental harm to the systems
LXC for Web apps and databasesLXC for Web apps and databases- Shouldn't require extra privileges
Use capabilitiesUse capabilities
Defense in depth!Defense in depth!- Use multiple security mechanisms- Both in the container and in the host (not different from the usual!)
Other SolutionsOther Solutions
One container per machineOne container per machine- Containers for fast deployment: Docker
One VM per containerOne VM per container- Run untrusted code within a VM within a container
Conclusion Conclusion
Containers seem to be the future of virtualizationContainers seem to be the future of virtualization- Already used in production settings- E.g. in Google
A stack of open source solutions is there!A stack of open source solutions is there!- Linux- LXC- Docker, ProjectBuilder, Puppet, etc.- PaaS
Every technology has its own drawbacksEvery technology has its own drawbacks- Security is a strong concern!
Thank You!Thank You!
Useful ReferencesUseful References
top related