An example graph of the new memory metrics.
(Click to enlarge!)
A while ago, we published a detailed blog explaining How to Optimize the Memory Usage of Your Apps. There was a strong emphasis on metrics. Because knowing the amount of used and available RAM is not enough, and doesn’t cut it when you’re trying to assess whether or not your apps need more memory.
With this in mind, we just released a new version of the dotCloud Dashboard. The new dashboard exposes more detailed memory metrics. You will now see that the memory allocated to your app is split in 4 parts: Resident Set Size, Active Page Cache, Inactive Page Cache, Unused Memory. Let’s review what they mean for your apps.
Resident Set Size
That’s essentially the memory used by processes when they
malloc() or do anonymous
mmap(). This memory is inelastic: it will amount to exactly what your app has been asking for, no more, no less. If your app asks for more than what is available, it will be restarted. If the memory usage was due to a leak or to the occasional odd request, restarting the app will get it back on track. However, if your app constantly needs more of this kind of memory than what is available, it will constantly be restarted, and it will appear to be unstable.
We detect out-of-memory conditions, and we report them to you: we send e-mail notifications, and we record them to display them on the dashboard. When you receive those notifications, you should take them very seriously, and scale up your app — or audit your code to reduce your memory footprint.
On the new memory graph, the resident set size is drawn in solid dark blue. It’s the baseline of your memory usage, and you should not scale your memory below that amount.
Active and Inactive Page Cache
When your app reads and write from disk, data never goes directly into the application buffers. It transits through the system’s buffer cache or page cache. It stays here for a while, so that if you request the same data again some time later, it will be available immediately, without performing actual disk I/O. Likewise, when you write something, it transits to the same buffer cache; this lets the system perform some optimizations regarding the order in which writes should be committed to disk.
The page cache is elastic: when you run out of memory, the system will happily discard it (since the cached data can be re-read anytime from the disk), or commit it to disk (in the case of cached writes). Conversely, if you havetons of memory, the system will happily retain as much as it can in the cache; which can lead to absurdly high memory uses for seemingly trivial apps. Typical example: a tiny HTTP server, handling requests for 10 MB of content, and using a few GB of page cache. How? Why? Well, because it’s also logging requests, and the log happens to be on disk. And Linux will keep the log in memory as well — if memory is available. Of course, if at some point you need the memory, Linux will free it up instantly. But meanwhile, if you look at your usage graphs, you will see the big memory usage.
On Linux, the page cache is split in two different pools: active and inactive. As the name implies, the active pool contains data that has been accessed recently, while the inactive pool contains data that is accessed less frequently. To make an informed scaling decision, it is important to understand how “active” and “inactive” really work under the hood. The memory is divided in pages, which are blocks of 4 KB. A given page of the buffer cache will start its existence (when it is loaded from the disk) as an active page. When an inactive page is accessed, it gets moved to the active pool. That part is easy! Now, when does an active page get move to the inactive pool? This doesn’t happen out of “old age” (i.e., a page being left untouched for a while). It happens when the active pool becomes bigger than the inactive pool! When there are more active pages than inactive ones, the kernel scans the active pages, and demotes a few of them to the inactive pool. Some time later, if there are still more active than inactive pages, it will do it again. It will go on until the balance is restored. However, at the same time, your app is running, and accessing memory; potentially moving inactive pages back to the active pool.
What does it mean? The bottom-line is the following: you should look at the active:inactive ratio. If this ratio is big (e.g. 200 MB of active memory vs. 20 MB of inactive memory), it means that the system is under heavy pressure. It’s constantly moving pages from active to inactive (to meet the 1:1 ratio), but the activity of your app is constantly moving pages back from inactive to active. In that case, it would be wise to scale verticaly, to achieve better I/O performance (since more data will fit in the cache). As you add more memory, the ratio will lower, and get closer to 1:1. A ratio of 1:1 (or even lower) means that the system is at equilibrium: it has moved all it could to inactive memory, and there was no strong pressure to put things back into active memory. You want to get close to this ratio (at least if you need good I/O performance).
On the new dashboard, active and inactive memory pools are shown in respectively medium-blue and light-blue shades, to highlight the fact that they are still important, but less than the (darker) resident set size.
Well, that one at least doesn’t deserve a long, technical explanation! If the metrics show that your app consistently has a leeway of free memory, you can definitely consider scaling down by that amount.
Warning: even if it’s often said that “free RAM is wasted RAM”, be wary of spikes! Take, for instance, a 1 GB Java app, which constantly shows 200 MB of Free Memory. Before scaling down to 800 MB, make sure that it is not experiencing occasional spikes that consumes that Free Memory! If you scale down, your app will be out of memory during the spikes, and will most likely crash. Also, remember that the long-term graphs (like the 7-days and 30-days trends) show average values; meaning that short bursts will not show up on those graphs. The metrics sample rate is 1 data point per minute; and that’s about the resolution that you can get on the 1-hour and 6-hours graphs. This means that unfortunately, short spikes (less than one minute) won’t appear on any graph.
On the new dashboard, the free memory in shown in light grey.
Putting It All Together
This is a lot of new information, but the new dashboard should make it very easy for you to figure out the appropriate vertical scaling for your application.
- For code services, make sure that the Resident Set Size (dark blue) never maxes out the available memory. If it gets close to it, you should add more memory before you receive out-of-memory notifications. Conversely, do not hesitate to cut through the Free Memory and the Inactive Page Cache (grey and light blue areas). The Page Cache will typically be small compared to the Resident Set Size.
- For database services (and static services), the previous rule applies as well, but the Page Cache (both Active and Inactive) will very likely be much bigger, and you will have to pay attention to that, too. As a rule of thumb, compare the Active and Inactive amounts during peak times. If Active is bigger than Inactive, your memory usage is close to being optimal. If they are equivalent (or if Inactive if larger), it means that you can scale down a little bit. This should be an iterative process: scale down, wait for memory usage to stabilize, check again, and repeat until the Active pool starts being larger.
We hope that the new dashboard can help you to make informed scaling decision, and cut down significantly on your dotCloud bill!