PaaS under the hood, episode 1: kernel namespaces

Making things simple is a lot of work. At dotCloud, we package terribly complex things – such as deploying and scaling web applications – into the simplest possible experience for developers. But how does it work behind the scenes? From kernel-level virtualization to monitoring, from high-throughput network routing to distributed locks, from dealing with EBS issues to collecting millions of system metrics per minute… As someone once commented, scaling a PaaS is “like disneyland for systems engineers on crack”. Still with us? Read on!

This is the 1st installment of a series of posts exploring the architecture and internals of platorm-as-a-service in general, and dotCloud in particular. For our first episode, we will introduce namespaces, a specific feature of the Linux kernel used by the dotCloud platform to isolate applications from each other.

Part 1: Namespaces

The first time I was introduced to Linux Containers (LXC), I got the (very wrong) impression that they relied mainly on control groups (or “cgroup” in short). It’s an easy mistake to make: each time you create a new container named e.g. “jose”, a cgroup of the same name appears, e.g. in “/cgroup/jose”. But actually, even if control groups are useful to Linux Containers (as we will see in part 2), the really important infrastructure is provided by namespaces.

Namespaces are the real magic behind containers. There are different kinds of namespaces, as we will see; each kind of namespace applies to a specific resource. And, each namespace creates barriers between processes. Those barriers can be at different levels.

The pid namespace

This is probably the most useful for basic isolation.

Each pid namespace has its own process numbering. Different pid namespaces form a hierarchy: the kernel keeps track of which namespace created which other. A “parent” namespace can see its children namespaces, and it can affect them (for instance, with signals); but a child namespace cannot do anything to its parent namespace. As a consequence:

  • each pid namespace has its own “PID 1” init-like process;
  • processes living in a namespace cannot affect processes living in parent or sibling namespaces with system calls like kill or ptrace, since process ids are meaningful only inside a given namespace;
  • if a pseudo-filesystem like proc is mounted by a process within a pid namespace, it will only show the processes belonging to the namespace;
  • since the numbering is different in each namespace, it means that a process in a child namespace will have multiple PIDs: one in its own namespace, and a different PID in its parent namespace.

The last item means that from the top-level pid namespace, you will be able to see all processes running in all namespaces, but with different PIDs. Of course, a process can have more than 2 PIDs if there are more than two levels of hierarchy in the namespaces.

The net namespace

With the pid namespace, you can start processes in multiple isolated environments (let’s bite the bullet and call them “containers” once and for all). But if you want to run e.g. a different Apache in each container, you will have a problem: there can be only one process listening to port 80/tcp at a time. You could configure your instances of Apache to listen on different ports… or you could use the net namespace.

As its name implies, the net namespace is about networking. Each different net namespace can have different network interfaces. Even lo, the loopback interface supporting 127.0.0.1, will be different in each different net namespace.

It is possible to create pairs of special interfaces, which will appear in two different net namespaces, and allow a net namespace to talk to the outside world.

A typical container will have its own loopback interface (lo), as well as one end of such a special interface, generally named eth0. The other end of the special interface will be in the “original” namespace, and will bear a poetic name like veth42xyz0. It is then possible to put those special interfaces together within an Ethernet bridge (to achieve switching between containers), or route packets between them, etc. (If you are familiar with the Xen networking model, this is probably no news to you!)

Note that each net namespace has its own meaning for INADDR_ANY, a.k.a. 0.0.0.0; so when your Apache process binds to *:80 within its namespace, it will only receive connections directed to the IP addresses and interfaces of its namespace – thus allowing you, at the end of the day, to run multiple Apache instances, with their default configuration listening on port 80.

In case you were wondering: each net namespace has its own routing table, but also its own iptables chains and rules.

The ipc namespace

This one won’t appeal a lot to you; unless you passed your UNIX 101 a long time ago, when they still taught about IPC (InterProcess Communication)!

IPC provides semaphoresmessage queues, and shared memory segments.

While still supported by virtually all UNIX flavors, those features are considered by many people as obsolete, and superseded by POSIX semaphoresPOSIX message queues, and mmap. Nonetheless, some programs – including PostgreSQL – still use IPC.

What’s the connection with namespaces? Well, each IPC resources are accessed through a unique 32-bits ID. IPC implement permissions on resources, but nonetheless, one application could be surprised if it failed to access a given resource because it has already been claimed by another process in a different container.

Introduce the ipc namespace: processes within a given ipc namespace cannot access (or even see at all) IPC resources living in other ipc namespaces. And now you can safely run a PostgreSQL instance in each container without fearing IPC key collisions!

The mnt namespace

You might already be familiar with chroot, a mechanism allowing to sandbox a process (and its children) within a given directory. The mnt namespace takes that concept one step further.

As its name implies, the mnt namespace deals with mountpoints.

Processes living in different mnt namespaces can see different sets of mounted filesystems – and different root directories. If a filesystem is mounted in a mnt namespace, it will be accessible only to those processes within that namespace; it will remain invisible for processes in other namespaces.

At first, it sounds useful, since it allows to sandbox each container within its own directory, hiding other containers.

At a second glance, is it really that useful? After all, if each container is chroot‘ed in a different directory, container C1 won’t be able to access or see the filesystem of container C2, right? Well, that’s right, but there are side effects.

Inspecting /proc/mounts in a container will show the mountpoints of all containers. Also, those mountpoints will be relative to the original namespace, which can give some hints about the layout of your system – and maybe confuse some applications which would rely on the paths in /proc/mounts.

The mnt namespace makes the situation much cleaner, allowing each container to have its own mountpoints, and see only those mountpoints, with their path correctly translated to the actual root of the namespace.

The uts namespace

Finally, the uts namespace deals with one little detail: the hostname that will be “seen” by a group of processes.

Each uts namespace will hold a different hostname, and changing the hostname (through the sethostname system call) will only change it for processes running in the same namespace.

 

Creating namespaces

Namespace creation is achieved with the clone system call. This system call supports a number of flags, allowing to specify “I want the new process to run within its own pid, net, ipc, mnt, and uts namespaces”. When creating a container, this is exactly what happens: a new process, with new namespaces, is created; its network interfaces (including the special pair of interfaces to talk with the outside world) are configured; and it executes an init-like process.

When the last process within a namespace exits, the associated resources (IPC, network interfaces…) are automatically reclaimed. If, for some reason, you want those resources to survive after the termination of the last process of the namespace, there is a way. Each namespace is materialized by a special file in /proc/$PID/ns. Using mount --bind on one of those special files, each namespace can be retained for future use.

Each namespace? Not quite. Up to (and including) kernel 3.4, only ipc, net, and uts namespaces appear here; mnt and pid namespaces do not. This can be a problem in some specific cases, as we will see in the next paragraph.

Attaching to existing namespaces

It is also possible to “enter” a namespace, by attaching a process to an existing namespace. Why would one want to do that? Generally, to run an arbitrary command within the namespace. Here are a few examples:

  • you want to setup network interfaces “from outside”, without relying on scripts inside the container;
  • you want to run an arbitrary command to retrieve some information about the container: you could generally obtain the same information by peeking at the container “from outside”, but sometimes, it might require specially patched tools (e.g. if you want to execute netstat);
  • you want to obtain a shell within the container.

Attaching a process to existing namespaces requires two things:

  • the setns system call (which exists only since kernel 3.0, or with patches for older kernels);
  • the namespace must appear in /proc/$PID/ns.

Wait, we mentioned those special files in the previous paragraph, and we told that only ipc, net, and uts namespaces were listed under /proc/$PID/ns! So how can we attach to existing mnt and pid namespaces? We can’t – unless we use a patched kernel.

Combining the necessary patches can be fairly tricky, and explaining how to resolve conflicts between AUFS and GRSEC could almost require a blog post by itself. So, if you don’t want to run an overly patched kernel, here are some workarounds.

  • You can run sshd in your containers, and pre-authorize a special SSH key to execute your commands. This is one of the easiest solutions, but if sshd crashes, or is stopped (intentionally or by accident), you’re locked out of the container. Also, if you want to squeeze the memory footprint of your containers as much as possible, you might want to get rid of sshd. If the latter is your main concern, you can run a low profile SSH server like dropbear, or you can start the SSH service from inetd or a similar service.
  • If you want something simpler than SSH (or if you want something different than SSH, to avoid interferences with sshd custom configurations), you can open a backdoor. An example would be to run socat TCP-LISTEN:222,fork,reuseaddr EXEC:/bin/bash,stderr from init in your containers (but make sure that port 222/tcp is correctly firewalled then).
  • An even better solution is to embed this “control channel” within your init process. Before changing its root directory, the init process could setup a UNIX socket on a path located outside the container root directory. When it will change its root directory, it will retain its open file descriptors – and therefore, the control socket.

How dotCloud uses namespaces

In its early age, the dotCloud platform used plain LXC (Linux Containers), and therefore, made implicit use of namespaces.

Very early, we deployed patched kernels, allowing us to attach arbitrary processes into existing namespaces – because we found it to be the most convenient and reliable way to deploy, control, and orchestrate containers. The platform evolved, and while the original “containers” are being stripped down (and bearing less and less similarity with usual Linux Containers), we still use namespaces to isolate applications from each other.

 

About Jérôme Petazzoni
JeromeJérôme is a senior engineer at dotCloud, where he rotates between Ops, Support and Evangelist duties and has earned the nickname of “master Yoda”. In a previous life he built and operated large scale Xen hosting back when EC2 was just the name of a plane, supervized the deployment of fiber interconnects through the French subway, built a specialized GIS to visualize fiber infrastructure, specialized in commando deployments of large-scale computer systems in bandwidth-constrained environments such as conference centers, and various other feats of technical wizardry. He cares for the servers powering dotCloud, helps our users feel at home on the platform, and documents the many ways to use dotCloud in articles, tutorials and sample applications. He’s also an avid dotCloud power user who has deployed just about anything on dotCloud – look for one of his many custom services on our Github repository.
Connect with Jérôme on Twitter! @jpetazzo