Making things simple is a lot of work. At dotCloud, we package terribly complex things – such as deploying and scaling web applications – into the simplest possible experience for developers. But how does it work behind the scenes?
From kernel-level virtualization to monitoring, from high-throughput network routing to distributed locks, from dealing with EBS issues to collecting millions of system metrics per minute… As someone once commented, scaling a PaaS is “like disneyland for systems engineers on crack”. Still with us? Read on!
This is the 2nd installment of a series of posts exploring the architecture and internals of platorm-as-a-service in general, and dotCloud in particular. You can find the 1st episode on namespaces here.
For our second episode, we will introduce cgroups. Control groups, or “cgroups”, are a set of mechanisms to measure and limit resource usage for groups of processes.
Conceptually, it works a bit like the ulimit shell command or the setrlimit system call; but instead of manipulating the resource limits for a single process, they allow to set them for groups of processes.
The easiest way to manipulate control groups is through the cgroup filesystem.
Assuming that it has been mounted on /cgroup, creating a new group named polkadot is as easy as mkdir /cgroup/polkadot. When you create this (pseudo) directory, it instantly gets populated with many (pseudo) files to manipulate the control group. You can then move one (or many) processes into the control group by writing their PID to the right control file, e.g. echo 4242 > /cgroup/polkadot/tasks.
When a process is created, it will be in the same group as its parent – so if the init process of a container has been placed in a control group, all the processes of the container will be in the same control group (unless they are moved later, of course).
Destroying a control group is as easy as rmdir /cgroup/polkadot – but the processes within the cgroup have to be moved to other groups first; otherwise rmdir will fail (it is like trying to remove a non-empty directory).
Technically, control groups are split into many subsystems; each subsystem is responsible for a set of files in /cgroup/polkadot, and the file names are prefixed with the subsystem name.
For instance, the files cpuacct.stat, cpuacct.usage, cpuacct.usage_percpu are the interface for the cpuacct subsystem. The available subsystems will be detailed in the next paragraph.
The subsystems can be used together, or independently. In other words, you can decide that each control group will have limits and counters for all the subsystems, or, alternatively, that each subsystem will have different control groups.
To explain the latter case: it lets you have a process into the polkadot control group for memory control, but in the bluesuedeshoe control group for CPU control (and in that case, polkadot and bluesuedeshoe are in completely separated namespaces).
What Can Be Controlled?
Many many things! We’ll list only the most useful here – or at least, those that we think are the most useful.
You can limit the amount of RAM and swap space that can be used by a group of processes.It accounts for the memory used by the processes for their private use (their Resident Set Size, or RSS), but also for the memory used for caching purposes.
This is actually quite powerful, because traditional tools (ps, analysis of /proc, etc.) have no way to find out the cache memory usage incurred by specific processes. This can make a big difference, for instance, with databases.
A database will typically use very little memory for its processes (unless you do complex queries, but let’s pretend you don’t!), but can be a huge consumer of cache memory: after all, to perform optimally, your whole database (or at least, your “active set” of data that you refer to the most often) should fit into memory.
Limiting the memory available to the processes inside a cgroup is as easy as echo1000000000 > /cgroup/polkadot/memory.limit_in_bytes (it will be rounded to a page size).
To check the current usage for a cgroup, inspect the pseudo-filememory.usage_in_bytes in the cgroup directory. You can gather very detailed (and very useful) information into memory.stat; the data contained in this file could justify a whole blog post by itself!
You might already be familiar with scheduler priorities, and with the nice and renice commands. Once again, control groups will let you define the amount of CPU, that should be shared by a group of processes, instead of a single one. You can give each cgroup a relative number of CPU shares, and the kernel will make sure that each group of process gets access to the CPU in proportion of the number of shares you gave it.
Setting the number of shares is as simple as echo 250 > /cgroup/polkadot/cpu.shares. Remember that those shares are just relative numbers: if you multiply everyone’s share by 10, the end result will be exactly the same. This control group also gives statistics incpu.stat.
This is different from the cpu controller.In systems with multiple CPUs (i.e., the vast majority of servers, desktop & laptop computers, and even phones today!), the cpuset control group lets you define which processes can use which CPU.
This can be useful to reserve a full CPU to a given process or group of processes. Those processes will receive a fixed amount of CPU cycles, and they might also run faster because there will be less thrashing at the level of the CPU cache.
On systems with Non Uniform Memory Access (NUMA), the memory is split in multiple banks, and each bank is tied to a specific CPU (or set of CPUs); so binding a process (or group of processes) to a specific CPU (or a specific group of CPUs) can also reduce the overhead happening when a process is scheduled to run on a CPU, but accessing RAM tied to another CPU.
The blkio controller gives a lot of information about the disk accesses (technically, block devices requests) performed by a group of processes. This is very useful, because I/O resources are much harder to share than CPU or RAM.
A system has a given, known, and fixed amount of RAM. It has a fixed number of CPU cycles every second – and even on systems where the number of CPU cycles can change (tickless systems, or virtual machines), it is not an issue, because the kernel will slice the CPU time in shares of e.g. 1 millisecond, and there is a given, known, and fixed number of milliseconds every second (doh!). I/O bandwidth, however, is quite unpredictable. Or rather, as we will see, it is predictable, but the prediction isn’t very useful.
A hard disk with a 10ms average seek time will be able to do about 100 requests of 4 KB per second; but if the requests are sequential, typical desktop hard drives can easily sustain 80 MB/s transfer rates – which means 20000 requests of 4 kB per second.
The average throughput (measured in IOPS, I/O Operations Per Second) will be somewhere between those two extremes. But as soon as some application performs a task requiring a lot of scattered, random I/O operations, the performance will drop – dramatically. The system does give you some guaranteed performance, but this guaranteed performance is so low, that it doesn’t help much (that’s exactly the problem of AWS EBS, by the way). It’s like a highway with an anti-traffic jam system that would guarantee that you can always go above a given speed, except that this speed is 5 mph. Not very helpful, is it?
That’s why SSD storage is becoming increasingly popular. SSD has virtually no seek time, and can therefore sustain random I/O as fast as sequential I/O. The available throughput is therefore predictably good, under any given load.
Actually, there are some workloads that can cause problems; for instance, if you continuously write and rewrite a whole disk, you will find that the performance will drop dramatically. This is because read and write operations are fast, but erase, which must be performed at some point before write, is slow. This won’t be a problem in most situations.
An example use-case which could exhibit the issue would be to use SSD to do catch-up TV for 100 HD channels simultaneously: the disk will sustain the write throughput until it has written every block once; then it will need to erase, and performance will drop below acceptable levels.)
To get back to the topic – what’s the purpose of the blkio controller in a PaaS environment like dotCloud?
The blkio controller metrics will help detecting applications that are putting an excessive strain on the I/O subsystem. Then, the controller lets you set limits, which can be expressed in number of operations and/or bytes per second. It also allows for different limits for read and write operations. It allows to set some safeguard limits (to make sure that a single app won’t significantly degrade performance for everyone). Furthermore, once a I/O-hungry app has been identified, its quota can be adapted to reduce impact on other apps.
It’s Not Only For Containers
As we mentioned, cgroups are convenient for containers, since it is very easy to map each container to a cgroup. But there are many other uses for cgroups.
The systemd service manager is able to put each service in a different cgroup. This allows to keep track of all the subprocesses started by a given service, even when they use the “double-fork” technique to detach from their parent and re-attach to init. It also allows fine-grained tracking and control of the resource used by each service.
It is also possible to run a system-wide demon, to automatically classify processes into cgroups. This can be particularly useful on multi-user systems, to limit (or meter) appropriately the resources of each user, or to run some specific programs in a special cgroup, when you know that those programs are prone to excessive resource use.
dotCloud & Control Groups
Thanks to cgroups, we can meter very accurately the resource usage of each container, and therefore of each unit of each service of each application.
We will probably give more details about our metrics collection system in a future blog post. Meanwhile, if you want to know, it uses collectd, along with our in-house lxc plugin. Metrics are streamed to a custom storage cluster, and can be queried and streamed by the rest of the platform using our ZeroRPC protocol.
We also use cgroups to allocate resource quotas for each container. For instance, when you use vertical scaling on dotCloud, you are actually setting limits for memory, swap usage, and CPU shares.