Making things simple is a lot of work. At dotCloud, we package terribly complex things – such as deploying and scaling web applications – into the simplest possible experience for developers. But how does it work behind the scenes?
From kernel-level virtualization to monitoring, from high-throughput network routing to distributed locks, from dealing with EBS issues to collecting millions of system metrics per minute. As someone once commented, scaling a PaaS is “like disneyland for systems engineers on crack”.
This is the 5th episode of a series of posts exploring the architecture and internals of platorm-as-a-service in general, and dotCloud in particular.
PART 5: Distributed routing
The dotCloud platform is powered by hundreds of servers, some of them running more than one thousand containers. The majority of these containers are HTTP servers, and they handle millions of HTTP requests every day, to power the apps hosted on dotCloud.
All the HTTP traffic is bound to a group of special machines, the “gateways”. The gateways parse HTTP requests, and route them to the appropriate backends. When there are multiple backends for a single service, the gateways also deal with the load balancing and failover. Last but not least, the gateways also forward HTTP logs to be processed by the metrics cluster.
HTTP Routing Layer
This “HTTP routing layer”, as we call it, runs on an elastic number of dedicated machines. When the load is low, 3 machines are enough to deal with the traffic. When spikes or DoS attacks happen, we scale up to 6, 10, or even more machines, to ensure optimal availability and latency.
The following drawing shows a simplified view of what happens:
All HTTP requests are bound to the “HTTP routing layer”, a cluster of identical HTTP load balancers. Each time you create, update (e.g. scale), or delete an application on dotCloud, the configuration of those load balancers has to be updated.
The “master source” for all the configuration is stored within a Riak cluster, working in tandem with a Redis cache. The configuration is modified using basic commands: create a HTTP entry, add/remove a frontend (virtual host), add/remove a backend (container). The commands are passed through a ZeroRPC API. Each update done through the API propagates through the platform; in the next sections, we will see which mechanisms are used.
Version 1: Nginx + ZeroRPC
As you probably know, a start-up must be lean, agile, and many other things. It also needs to be pragmatic, and the Right Solution is not always the nicest one, but the one that you can ship on time. That’s why the first iteration of our routing layer had some shortcomings, as we will see. But it functioned properly up to a few tens of thousands of apps.
Obviously, as the number of apps grew, the size of the configuration grew as well. Sending differential updates would have been better. But at least, when a load balancer lost a few configuration messages, there was no special case to handle: the next update would contain the full configuration, and provide all the necessary information.
The configuration was transmitted using a compressed, efficient format. Then, each load balancer would transform this abstract configuration into a Nginx configuration file, and inform Nginx to reload this configuration. Nginx being very well designed, while loading the new configuration, it still serves request with the old one; meaning that no HTTP request is lost during the configuration update.
Nginx also handles nicely load balancing, and fail-over. When a backend dies, Nginx detects it, removes it from the pool, and periodically tries it again, to re-add it to the pool once it has fixed itself. This setup had two issues:
- Nginx does not support the WebSocket protocol, which was one of the top requested features by our users at that time
- Nginx has no support for dynamic reconfiguration, meaning that each configuration update requires the whole configuration file to be regenerated and reloaded
At some point, the load balancers started to spend (or rather: waste) a significant fraction of their CPU time to reload Nginx configurations. There was no significant impacts on running applications, but it required to deploy more and more powerful instances as the number of apps increased.
Nginx was still fast, and efficient, but we had to find a more dynamic alternative.
Version 2: Node.js + Redis + WebSocket = Hipache
We spent some time digging through several kind of languages and different technologies to solve this issue. We needed the following features:
- add, update, and remove virtual hosts dynamically, with a very low cost
- support the WebSocket protocol
- great flexibility and control over the routed requests: we want to be able to trigger actions, log events, etc., at different steps of the routing
After looking around, we finally decided to implement our own proxy solution
- use multi-core machines by scaling the load to multiple workers
- store HTTP routes in Redis, allowing live configuration updates
- passive health-checking (when a backend is detected as being down, it is removed from the rotation)
- efficient logging of requests
- memory footprint monitoring: if a leak causes the memory usage of a worker to go beyond a given threshold, the worker is gracefully recycled
- independent from other dotCloud technologies (like ZeroRPC), to make the proxy fully re-usable by third parties (the code being, obviously, Open Source)
After several months of engineering and intensive testing, we released the source code of Hipache: our new distributed proxy solution!
Behind the scenes, integrating Hipache into the dotCloud platform was very straight-forward, thanks to our service-oriented architecture.
We simply wrote a new adapter which consumed virtual host configurations from the existing ZeroRPC service, and used it to update Hipache’s configuration in Redis. No refactoring or modification of the platform was necessary.
A word about dynamic configuration and latency… Storing the configuration in an external system (like Redis) means that you have to make trade-offs:
- you can look up the configuration at each request, but it requires a round-trip to Redis at each request, which adds latency
- you can cache the configuration locally, but you will have to wait a bit for your changes to take effect, or implement a complex cache-busting mechanism
I implemented a cache mechanism to avoid hitting Redis at each request. But that wasn’t necessary, because we realized that requests done to a local Redis are very, very fast. The difference between direct lookups and cached lookups was less than 0.1ms, which was in fact below the error margin of our measurements….
Version 3: Active health checks
Hipache has a simple embedded health-check system: when a request fails because of some backend issue (TCP errors, HTTP 5xx responses, etc.), the backend is flagged as being dead, and remains in this state for 30 seconds. During these 30 seconds, no request is sent to the backend; then it goes back to normal state (but if it is faulty, it will immediately be re-flagged as dead). This mechanism is simple enough and it works, but it has three caveats:
- if a backend is frozen, we will still send requests to it, until it gets marked as dead
- when a backend is repaired, it can take up to 30 seconds to mark it live again
- a backend which is permanently dead will still receive a few requests every 30 seconds
To address those 3 problems, we implemented active health checks. The health checker permanently monitors the state of the backends, by doing the HTTP equivalent of a simple “ping”. As soon as a backend stops replying correctly to ping requests, it is marked as dead. As soon as it starts replying again, it is marked as live. The HTTP “pings” can be sent every few seconds, meaning that it will be much faster to detect when a backend changes.
To implement the active health checker, we considered multiple solutions: Node.js, Python+gevent, Twisted… And finally decided to roll it with the Go language. Why Go? Well, for multiple reasons:
- the health checker is massively concurrent (hundreds, and even thousands of HTTP connections can be “in flight” at a given time)
- Go programs can be compiled and deployed as a single, stand-alone, binary
- we have other tools doing massively concurrent queries, and this was an excellent occasion to do some comparative benchmarks (but don’t fret: those benchmarks aren’t ready yet, and will be the topic of a future blog post!)
The active health checker is completely optional. You don’t need it to run Hipache, and you can plug it on top of an existing Hipache installation without modifying Hipache configuration: it will detect and update Hipache configuration directly through the Redis used by Hipache itself. In other words, it gets along perfectly fine with Hipache embedded passive health-checking system, and running it will just improve the dead backend detection.
Since this HTTP routing layer is a major part of the dotCloud infrastructure, we’re eager to find ways to make it better all the time.
Recently we did some research and tests to see if there was some way to implement dynamic routing with Nginx. In fact, we aimed for an even higher goal: we wanted to route requests with Nginx, using configuration rules stored in Redis, using the format currently used by Hipache. This would allow to re-use may components: mainly, the Redis feeder; but also the active health checker which uses the same configuration format.
Guess what: we found something! Almost one year ago, when we started to think about the design of Hipache and start its implementation, we looked at the Nginx Lua module. Since then, it improved a lot, and it looked like an ideal candidate.
We started an experimental project which lets Nginx mimic Hipache, by using the same Redis configuration format. Nginx deals with the request proxying, while the routing logic is all in Lua. I used the excellent lua-resty-redis module to talk to Redis from Nginx.
This project, called hipache-nginx, is opensource.
Some preliminary benchmarks show that under high load, hipache-nginx can be 10x faster than the original Hipache in Node.js. The benchmarks have to be refined, but it looks like hipache-nginx can deliver the same performance than hipache-nodejs with 10x less resources. So, while the code is still experimental, it shows that there is plenty of room for improvement in the current HTTP routing layer. Even if it will probably affect only apps with 10,000-100,000 requests per second, it is still worth investigating!
Previously on “PaaS under the Hood”: