Let’s Talk about Service Dependencies

(aka: Escape from System D, part IV).

First: anyone who’s been keeping tabs will have noticed that there hasn’t been a lot of progress on Dinit recently; this has been due to multiple factors, one being the hard disk drive in my laptop dying and this impeding my ability to work on the train to and from work, which is when I usually found time to work on Dinit. However, I’ve by no means abandoned the project, will hopefully have a replacement laptop soon, and expect the commits to resume in due course (there have been a small number made recently, in fact).

In this post I wanted to discuss service dependencies and pros and cons of managing them in slightly different ways. In an earlier post I touch on the basics of service management with dependencies:

if one service needs another, then starting the first should also start the other, and stopping the second should also require the first to stop.

It’s clear that there are two reasons that a service could be running:

  1. It has been explicitly started, or
  2. It has been started because another service which depends on it has been started.

This is all very well, but in the 2nd case, there’s an open question about what to do when the dependency service stops. There are two choices in this regard:

  1. A started service remains running when its dependencies stop, even if the service has not itself been explicitly started, or
  2. A started service automatically stops when its dependencies stop (unless it has itself been explicitly started).

Which is the better option? The first option is probably simpler to implement (it doesn’t require tracking whether a service was explicitly started, for instance); the second option, though, has the nice properties that (a) it doesn’t keep unneeded services running and (b) explicitly starting and then stopping a service will return the system to the original state (in terms of which services are running). Also, if you want to emulate the concept of run levels (which essentially describe a set of services to run exclusively), you can do so easily enough; switching run level is equivalent to explicitly starting the appropriate run level service and stopping the current one.

(Systemd makes a distinction between service units, which describe a process to run, and target units, which group services. However, I’m not sure there’s a real need for this distinction; services can depend on other services anyway, so the main difference is that one has an individual associated process and the other doesn’t. Indeed Systemd’s systemctl isolate command can accept a service unit, although it expects a target unit by default. Dinit on the other hand makes no real distinction between services and targets at this higher level.)

There are some complications, though, which necessarily add complexity to the service model described above. Mainly, we want some flexibility in how dependency termination is handled. The initial “boot” service, for instance, probably shouldn’t stop (and release all its dependencies as a result) if a single dependency (let’s say the sshd server, for example) terminates unexpectedly; similarly, we wouldn’t necessarily want boot to be considered failed if any of a number of certain dependency services failed to start. On the other hand, for other service/dependency combinations, we might want exactly that: if the dependency fails then the dependent also fails, and if the dependency stops then the dependent also stops.

Other problems we need to solve:

  • It may be convenient to have persistent services that remain started after they are started (due to a dependent starting, even when the dependent stops. For instance, if we have a service which mounts the filesystem read/write (from read-only) it’s probably convenient to leave it “running” after it starts, since undoing this is complicated and may be error-prone.
  • Boot failure needs a contingency; it should be possible to configure what happens if some service essential for boot fails (whether it be to start a single-user shell, reboot, power off, or simply stop with an error message).

With all the above in mind, I’ve narrowed down the necessary dependency types as follows:

  • regular – the dependency must start before the dependent starts, and if the dependency stops then the dependent stops.
  • soft – the dependency starts (in parallel) with the dependent, but if it fails or stops this does not affect the dependent. It’s not precisely clear that this dependency type is necessary in its own right, but it forms the basis for the following two dependency types.
  • waits-for – as for soft, but the dependent waits until the dependency starts (or fails) before it starts itself.
  • “milestone” – The dependency must start before the dependent starts, but once the dependent has started, the dependency link becomes soft. This is different from “waits-for” in that if the dependency fails, the dependent will not start.

This is what I’m currently implementing (up until now, only “regular” and “waits-for” dependencies have been supported by Dinit).

For the boot failure case, Dinit currently starts the service named “single” (i.e. the single-user service); however, some flexibility / configurability might be added at a later date.

For next time

There are a lot of things that I want write about and implement, and though finding the time has been increasingly difficult lately I’m hoping things will calm down a little over the next few months.

One thing I really need to do is look again, properly, at some of the other supervision/init systems out there. There are two motivations for this: one, determining whether Dinit is really necessary in its own right –  that is, can any of the existing systems do everything that I’m hoping Dinit will be able to, and would it make sense to collaborate with / contribute to one of them? In particular s6 and Nosh are two suites which seem like they are well-designed and capable. (Note that I don’t envisage stopping work on Dinit altogether, and don’t feel like availability of another quality init system is going to be a bad thing).

There’s still a lot more work that needs to be done with Dinit, too. Presently it’s not possible to modify loaded service definitions (including changing dependencies) which is certainly a must-have-for-1.0 feature, but that’s really just the tip of the iceberg. At some point I’d like to create a formal list of what is needed to truly supplant Systemd in the common Linux software ecosystem. Completing the basic Dinit functionality remains a priority for now, however.

Thanks for reading and, as always, constructive comments are welcome.


Safety and Daemons

(aka. Escape from System D, part III).

So Dinit (github) is a service manager and supervisor which can function as an init process. As I’ve previously discussed, an init needs to be exceptionally stable: if it crashes, the whole system will come down with it. A service manager which manages system services, though, also needs to be stable, even if it’s not also running as an init: it’s likely that a service manager failure will cause parts of the system to stop working correctly.

But what do we mean by stable, in this case? Well, obviously, part of what we mean is that it shouldn’t crash, and part of that means we want no bugs. But that’s a narrow interpretation and not a useful one; we don’t really want bugs in any software. A big part of being stable – the kind of stable we want in an init or service manager – is being robust in the face of resource scarcity. One resource we are concerned about is file descriptors, and one of the most obvious is memory. In C, malloc can fail: it returns a null pointer if it cannot allocate a chunk of the requested size – and this possibility is ignored only at some peril. (One class of security vulnerability occurs when a program can be manipulated into attempting allocation of a chunk so large that the allocation will certainly fail, and the program fails to check whether the allocation was successful).

Consider now the xmalloc function, implementations of which abound. One can be found in the GNU project’s libiberty library, for example. xmalloc behaves just like malloc except that it aborts the program when the allocation fails, rather than returning a null pointer. This is “safe” in the sense that it prevents program misbehaviour and potential exploits, although is sometimes less than desirable from an end-user perspective. In a service manager, it would almost certainly be problematic. In an init, it would be disastrous. (Note that in Dinit, it is planned to separate the init process from the service manager process. Currently, however, they are combined).

So, in Dinit, if a memory allocation fails, we want to be able to handle it. But also, importantly, we want to avoid (as much as possible) making critical allocations during normal operation – that is, if we could not proceed when an allocation failed, it would be better if avoided the need for allocation altogether.

How Dinit plays safe

In general Dinit tries to avoid dynamic memory allocation when it’s not essential; I’ll discuss some details shortly. However, there’s another memory-related resource which can be limited: the stack. Any sort of unbounded recursion potentially exhausts the stack space, and this form of exhaustion is much harder to detect and deal with than regular heap space exhaustion. The simplest way to deal with this is to avoid unbounded recursion, which Dinit mostly does (there is still one case that I know of remaining – during loading of service descriptions – but I hope to eliminate it in due course).

Consider the process of starting a service. If the service has dependencies, those must be started too, and the dependencies of those dependencies must be started, and so on. This would be expressed very naturally via recursion, something like:

void service::start() {
    for (auto dep : dependencies) {
    do_start(); // actually start this service

(Note this is very simplified code). However, we don’t want recursion (at least, we don’t want recursion which uses our limited stack). So instead, we could use a queue allocated on the heap:

void service::start() {
    // (throws std::bad_alloc on out-of-memory).
    // start with a queue containing this service,
    // and an empty (heap-allocating) stack:
    std::queue<service *> start_queue = { this };
    std::stack<service *> start_stack;

    // for each dependency, add to the queue. Build the stack:
    while (! start_queue.empty()) {
        for (auto dep : start_queue.front()->dependencies) {

    // start each service in reverse dependency order:
    while (! start_stack.empty()) {

This is considerably more complicated code, but it doesn’t implicitly use our limited stack, and it allows us to catch memory space exhaustion (via the std::bad_alloc exception, which is thrown from the queue and stack allocators as appropriate). It’s an improvement (if not in readability), but we’ve really just traded the use of one limited resource for another.

(Also, we need to be careful that we don’t forget to catch the exception somewhere and handle it appropriately! An uncaught exception in C++ will also terminate the program – so we essentially get xmalloc behaviour by default – and because of this, exceptions are arguably a weakness here; however, they can improve code readability and conciseness compared to continually checking for error status returns, especially in conjunction with the RAII paradigm. We just need to be vigilant in checking that we always do catch them!).

Edit: incidentally, if you’re thinking that memory allocation failure during service start is a sure sign that we won’t be able to launch the service process anyway, you’re probably right. However, consider service stop. It follows basically the same procedure as start, but in reverse, and not being able to stop services in a low-memory environment would clearly be bad.

We can improve further on the above: note that while the service dependency graph is not necessarily a tree, we only need to start each dependency once (the above code doesn’t take this into account, potentially issuing do_start() to the same service multiple times if it is a dependency of multiple other services). Given that a service only need appear in start_queue and start_stack once, we can actually manage those data structures as linked lists where the node is internal to the service (i.e. the node doesn’t need to be allocated separately).

For example, service might be defined as something like:

class service {
    std::string name;
    std::list<service *> dependencies;
    // (other details)
    bool is_in_start_queue = false;
    bool is_in_start_stack = false;
    service * next_in_start_queue = nullptr;
    service * next_in_start_stack = nullptr;
    void start();
    void do_start();

Now, although it requires extra code (again) because we can’t use the standard library’s queue or stack, we can manage the two data structures without performing any allocations. This means we can rewrite our example start() in such a way that it cannot fail (though of course in reality starting a service requires various additional steps – such as actually starting a process – for which we can’t absolutely guarantee success; however, we’ve certainly reduced the potential failure cases).

In fact, in Dinit a service can be part of several different lists (technically, order-preserving sets). I wrote some template classes to avoid duplicating code to deal with the different lists, which you can find in the source repository. Using these templates, we can rewrite the example service class and the start() method, as follows:

class service {
    std::string name;
    std::list<service *> dependencies;
    // (other details)
    lld_node<service> start_queue_node;
    lls_node<service> start_stack_node;
    void start();
    void do_start();

    static auto &get_startq_node(service *s) {
        return s->start_queue_node;
    static auto &get_starts_node(service *s) {
        return s->start_stack_node;

void service::start() {
    // start with a queue containing this service,
    // and an empty (heap-allocating) stack:
    dlist<service, service::get_startq_node> start_queue;
    slist<service, service::get_starts_node> start_stack;

    // for each dependency, add to the queue. Build the stack:
    while (! start_queue.is_empty()) {
        auto front = start_queue.pop_front();
        for (auto dep : front->dependencies) {
            if (! start_queue.is_queued(dep))
        if (! start_stack.is_queued(dep)) {

    // start each service in reverse dependency order:
    while (! start_stack.is_empty()) {

(Note that the templates take two arguments: one is the element type in the list, which is service in this case, and the other is a function to extract the list node from the element. The call to this function will normally be inlined by the compiler, so you end up paying no abstraction penalty).

This is a tiny bit more code, but it’s not too bad, and compared to the previous effort it performs no allocations and avoids issuing do_start() to any service more than once. The actual code in Dinit is somewhat more complicated, but works roughly as outlined here. (Note, I snuck some C++14 into the code above; Dinit itself remains C++11 compatible at this stage).

There’s more to resource safety than memory and stack usage; I may discuss a little bit more in the future. I hope this post has provided some interesting perspective, however. As usual, comments are welcome.


Since last post, I’ve added a “stop timeout” for services – this allows setting a maximum time for a service to stop. If it takes longer than the allowed time, the service process is issued a SIGKILL which (unless something really whack is going on) should cause it to terminate immediately. I’ve set the default to 10 seconds, which seems reasonable, but it can be configured (and disabled) via the service description file.

(I’m not sure if I really want this to be enabled by default, or whether 10 seconds is really enough as a default value – so this decision may be revisited. Opinions welcome).

Other than that, it’s been bugfixes, cleaning up TODO’s in the code, and minor robustness improvements. I’m aiming for complete service management functionality soon (and in fact Dinit already works well in this capacity, but is missing one or two features that I consider important).

Escape from System D (2)

Episode II: Init versus the service management daemon

I was pleased that my announcement of another in-development init/service manager met with a mostly positive response. I plan to keep making semi-regular posts where I post both general discussion around the issues of service management and progress updates on my own effort, dubbed Dinit.

In this post I will give a little background on init systems and service management generally. I expect a lot of readers will not learn much, since it is already well understood, but it is worth laying out some background for reference in future posts/discussion.

What is “init”?

The init process, traditionally started from /sbin/init on the filesystem, is the first userspace process to launch on the system. As such it is the only process with no parent process. Most (if not all) operating systems give it a process ID of 1, making it easy to identify. There are two special things about the init process:

  1. First, it automatically becomes the new parent of otherwise orphaned processes. In particular processes which “daemonise” themselves by double-forking and letting the intermediate parent die get re-parented to the init process.
  2. If the init process terminates, for any reason, the kernel panics (so the whole system crashes).

The second point is in fact not necessarily true – it just so happens that, at least on Linux, if the init process dies then the system dies with it. I am not sure how the various *BSD systems react, but in general, it is not expected that the init process will terminate. This means that it is very, very important that the init process does not crash. However, the first point above has some implications as well, which we’ll get to shortly.

Notionally, the init system has two jobs: to reap its child processes when they have terminated (this is accomplished using the wait system call or one of its variants; reaping a terminated process ensures that its resources are freed and that it is no longer listed in the process table of the system) and also to start up the system, which it can potentially do just by running another process. An init may also be involved in the system shutdown process as well, though strictly speaking that’s not necessary.

You might be interested in Rich Felker’s example of a minimal init system, which is part of one of his blog posts (where he also discusses Systemd). It’s less than a screenful of text – small enough that it can be “obviously bug free” – a nice attribute to have for an init, for reasons outlined above.

So what is a “service manager”?

A service manager provides, at the most basic level, a means for stopping and starting individual services. Services quite typically run as a process – consider for example the ssh server daemon, sshd – but sometimes exist in some other form; having the network connection(s) up and operational, for example, could be enacted by means of a service. Typical modern systems have a service manager which is either started from the init process or incorporated in it (Systemd is an example of an init process which incorporates service management functionality, but there are various others which do the same).

Aside from just an interface to starting and stopping services, service managers may provide:

  • process supervision – which normally amounts to the ability to restart a service process if it terminates unexpectedly (in general, this is a mitigation measure against software faults)
  • service dependency management – if one service needs another, then starting the first should also start the other, and stopping the second should also require the first to stop.
  • a logging mechanism for dealing with output from service processes (in general, though, this can be delegated largely to a secondary process).

Since a service manager is naturally somewhat more complex than a standalone init system, it should be obvious that incorporating the two in one process has some inherent risks. If an init system terminates unexpectedly, the whole system will generally crash; not only is this inconvenient for the user, but it also makes analysing the bug that caused the crash more difficult.

Why combine them, then?

The obvious question: if it’s better to keep init as simple as possible, why does it get combined with service management? One reason is so that double-forking processes, which have re-parented to the init process, can be supervised; normal POSIX functions only allow receiving status notifications for direct child processes. (Various *BSDs support watching arbitrary process status via the kqueue system calls, but the interface has flaws – that I will perhaps discuss another time – and anyway, any mechanism to watch a non-immediate-child process by process ID, without co-ordination with the parent process, is prone to a race condition: at least in theory, a process with a given ID can die, and be reaped, and the process ID can be recycled, in between some other process discovering the process ID and setting up a watch for it or even worse sending a termination signal in an attempt to shut down a service).

Now we could just about argue that no service should double-fork, and this is eliminates any need for the service manager to run as the init process (PID 1). However, we can’t actually prevent processes from double-forking; on the other hand, there is a mechanism – at least on Linux – called cgroups, which allows for tracking process origin even through double-fork. Importantly, this can be used to track processes belonging to particular user sessions. One operation that we might naturally want to perform to a cgroup is to terminate it – or rather, terminate all processes in the cgroup – and this, once again, is racy unless we can co-ordinate with the parent process(es) of all processes in the cgroup (and by “coordinate” I mean that we want to prevent the parent process from reaping child processes which have terminated, to avoid the race where a process ID is recycled and the wrong process is then terminated, as described above),

(Some other systems might have functionality similar to cgroups – I have FreeBSD jails in mind, though I need to do some research to understand exactly how jails work and their limitations, and in particular if they also suffer the termination race problem described above).

So, for supervising double-forked processes, and for controlling user sessions, having control of the PID 1 (init) process is important for a service manager. However, there’s a hint in what hasn’t been said: while we may need co-operation between the init process and the service manager, it’s not absolutely necessary that they are the same process. One of the ideas I’d like to investigate with Dinit is whether we can keep a very simple init process and a separate, more complex, service manager / supervisor.

Dinit progress

For the most part reactions to my announcement of Dinit were positive. One comment on Reddit wondered how I was going to be able to achieve a “solid-as-a-rock stable” system using a non-memory-safe language (C++) and without having any tests. Of course, this wasn’t quite correct; I have always had tests for Dinit, but they were not automated. One thing that I’ve done since my initial announcement is implement a small number of automated tests (that you can run using “make check”). I plan to write many more tests, but this feels like a good start. I’ll discuss the reasons for using C++ at some point, but it needs to borne in mind that while C++ is not memory-safe it is still perfectly possible to write stable software in such a language; it just takes a little more effort!

I’ve also done a little refactoring, solved one or two minor bugs, and improved the man pages. My TODO list is slowly getting smaller and I think Dinit is approaching the stage where it can be considered a high-quality service manager, though it is a way off from being a full replacement for Systemd.

Please feel free to comment below and/or check out the source code on the Github repository.

Escape from System D

Episode I: It’s obvious, init?

How is that for a title, right? I know, I know, they want me to call it “systemd” not “System D” or “Systemd” or anything else resembling a legitimate proper noun, but that doesn’t work quite so well for a sci-fi-esque sounding movie title as does the above.

Anyway, to cut straight to the chase: I’m writing an init system. I’m not happy with Systemd’s increasing feature creep, bugs and occasional developer attitude issues, and I know I’m not the only one; however, I do want an init system / service supervision and management system that is more capable than the ancient Sys V init, and which in theory – together with a number of other pieces of software – could effectively provide a fully functional replacement for Systemd, without taking its all-or-nothing approach, and without needlessly sacrificing backwards compatibility with pre-existing tools and workflows, while being simpler both conceptually and in implementation.

Yes, there are probably already other options. I have at least briefly looked at a number of them. (see here, though there are probably many that are missing from that list). In general I am not perfectly happy with any of them, which is why I’ve decided to write yet another. There may be a bit of NIH syndrome leading to this decision; that’s ok, I can live with that; partly we write software because there’s nothing else that will do the job, and party we write it just for the heck of it. This project is always going to be a large part for the latter.

So, what are my main goals? Let’s see:

  • This will be both an init system and a service manager / process supervisor, in this sense similar to Systemd. It both boots the system, runs services (and allows them to be controlled), and shuts the system down. It will be able to automatically restart services that fail, when it is sensible to do so.
  • The dependency model is simple, but effective. You should be able to express ordering between services that specifies one service requires another to have started first.
  • Simplicity in general is an explicit goal, as is ease of configuration and use. (Sometimes these conflict).
  • It will be cross-platform. At least, it should run on most POSIX systems, not just Linux.
  • It will be both efficient and maintainable.
  • It will be stable. Solid-as-a-rock stable.

That should probably do for now. I know that these could be considered lofty goals; I’ll discuss more on each point in a number of follow-up posts, and where applicable I’ll discuss differences to Systemd and other init systems and service managers, anything that will make this particular software special, and any ways in which things might be improved generally.

What I do need to say before I finish up, however, is that this software is real: it has a name – “Dinit” – and in fact it already has a good body of source code and documentation, as can be seen on the Github page. What’s more, I’m already using it to boot my own system (although I’m not going to recommend at this stage that anyone else should try to use it for that just yet).

And, oh yeah, it’s written in C++. I know a number of people are going to snub their noses at the project just because of that; I don’t care. As far as I’m concerned the realistic alternatives at this point in time are probably C and Rust; C++ beats C hands down and Rust just isn’t, in my eyes, quite ready yet, though it may be one day (maybe even soon). I’ll discuss the precise reasons for choosing C++ (beyond “I happen to like it”, which is certainly one reason) in a future post, but ultimately arguing about programming languages isn’t something I really want to get into.

In the next episode of Escape From System D, I will discuss the role of init and service management systems and why they are often combined.


Bus1 is the new Kdbus

For some time there have been one or two developers essentially trying to move part of D-Bus into the kernel, mainly (as far as I understand) for efficiency reasons. This has so far culminated in the “Bus1” patch series – read the announcement from here (more discussion follows):

Bus1 is a local IPC system, which provides a decentralized infrastructure to share objects between local peers. The main building blocks are nodes and handles. Nodes represent objects of a local peer, while handles represent descriptors that point to a node. Nodes can be created and destroyed by any peer, and they will always remain owned by their respective creator. Handles on the other hand, are used to refer to nodes and can be passed around with messages as auxiliary data. Whenever a handle is transferred, the receiver will get its own handle allocated, pointing to the same node as the original handle.

Any peer can send messages directed at one of their handles. This will transfer the message to the owner of the node the handle points to. If a peer does not posess a handle to a given node, it will not be able to send a message to that node. That is, handles provide exclusive access management. Anyone that somehow acquired a handle to a node is privileged to further send this handle to other peers. As such, access management is transitive. Once a peer acquired a handle, it cannot be revoked again. However, a node owner can, at anytime, destroy a node. This will effectively unbind all existing handles to that node on any peer, notifying each one of the destruction.

Unlike nodes and handles, peers cannot be addressed directly. In fact, peers are completely disconnected entities. A peer is merely an anchor of a set of nodes and handles, including an incoming message queue for any of those. Whether multiple nodes are all part of the same peer, or part of different peers does not affect the remote view of those. Peers solely exist as management entity and command dispatcher to local processes.

The set of actors on a system is completely decentralized. There is no global component involved that provides a central registry or discovery mechanism. Furthermore, communication between peers only involves those peers, and does not affect any other peer in any way. No global communication lock is taken. However, any communication is still globally ordered, including unicasts, multicasts, and notifications.

Ok, and maybe I’m missing something, but: replace “nodes” with “unix domain sockets” and replace “handles” with “file descriptors” and, er, haven’t we already got that? (Ok, maybe not quite exactly – but is it perhaps good enough?)


Sockets represent objects of a local peer, while descriptors represent descriptors that point to a socket. Sockets can be created and destroyed by any peer, and they will always remain owned by their respective creator.


Descriptors on the other hand, are used to refer to socket connections and can be passed around with messages as auxiliary data. Whenever a descriptor is transferred, the receiver will get its own descriptor allocated, pointing to the same socket connection as the original handle.

It’s already possible to pass file descriptors to another process via a socket. Technically passing a file descriptor connected to a socket gives the other peer the same connection to the socket, which is probably not conceptually identical to passing handles, which (if I understand correctly) is more like having another connection to the same socket. But could it be so hard to devise a standard protocol for requesting a file descriptor with a secondary connection, specifically so that it can be passed to another process?

Any peer can send messages directed at one of their file descriptors. This will transfer the message to the owner of the socket the descriptor points to. If a peer does not posess a descriptor to a given socket, it will not be able to send a message to that socket. That is, descriptors provide exclusive access management. Anyone that somehow acquired a descriptor to a socket is privileged to further send this descriptor to other peers. As such, access management is transitive. Once a peer acquired a descriptor, it cannot be revoked again. However, a socket owner can, at anytime, close all connections to that socket. This will effectively unbind all existing descriptors to that socket on any peer, notifying each one of the destruction

right? (except that individual connections to a socket can be “revoked” i.e. closed, which is surely an improvement if anything).

Unlike sockets and file descriptors, peers cannot be addressed directly. In fact, peers are completely disconnected entities. A peer is merely an anchor of a set of sockets and file descriptors, including an incoming message queue for any of those. Whether multiple sockets are all part of the same peer, or part of different peers does not affect the remote view of those. Peers solely exist as management entity and command dispatcher to local processes.

I suspect the only difference is that each “peer” has a single receive queue for all its nodes, rather than one per connection.

The set of actors on a system is completely decentralized. There is no global component involved that provides a central registry or discovery mechanism. Furthermore, communication between peers only involves those peers, and does not affect any other peer in any way. No global communication lock is taken. However, any communication is still globally ordered, including unicasts, multicasts, and notifications.

I think this is meant to read as “no, it’s not the D-Bus daemon functionality being subsumed in the kernel”.

But as per all above, is Bus1 really necessary at all? Is multicasting to multiple clients so common that we need a whole new IPC mechanism to make it more efficient? Does global ordering of messages to different services ever actually matter? I’m not really convinced.

Cgroups v2: resource management done even worse the second time around


It’s with some bemusement that I watch various Linux kernel developers flounder in their attempts to produce decent, process-hierarchy based, resource control. The first real attempt at this was Cgroups (for “control groups”; also known as “cgroups” due to the emerging trend of refusing to capitalise proper nouns, see also [Ss]ystemd). The original Cgroups interface is known as Cgroups-v1, and is described in the kernel documentation. The main principles are pretty straightforward: you mount a special “cgroup” file system somewhere, and create directories in it to represent control groups whose resource usage you want to limit; each directory created will automatically contain a bunch of files used to control the group. You can create nested hierarchy such that there are groups within other groups, and the nested groups share the resources of their parent group (and may be further limited). You move a process into a group by writing its PID into one of the group’s control files. A group therefore potentially contains both processes and subgroups.

The two obvious resources you might want to limit are memory and CPU time, and each of these has a “controller”, but there are potentially others (such as I/O bandwidth), and some Cgroup controllers don’t really manage resource utilisation as such (eg the “freezer” controller/subsystem). The Cgroups v1 interface allowed creating multiple hierarchies with different controllers attached to them (the value of this is dubious, but the possibility is there).

Importantly, processes inherit their cgroup membership from their parent process, and cannot move themselves out of (or into) a cgroup unless they have appropriate privileges, which means that a process cannot escape its any limitations that have been imposed on it by forking. Compare this with the use of setrlimit, where a process’s use of memory (for example) can be limited using an RLIMIT_AS (address space) limitation, but the process can fork and its children can consume additional memory without drawing from the resources of the original process. With Cgroups on the other hand, a process and all its children draw resources from the containing group.

Cgroups are all very nice, in principle, but the interface was perceived to have some problems, leading to an effort to produce an improved interface – Cgroups v2 – and this is where the fun begins.

Cgroups v2

The second version of the Cgroups interface is also described in kernel documentation; the most interesting part from the perspective of the issues that will be discussed here are in an appendix, titled “R. Issues with v1 and Rationales for v2”. Let’s take a look at these issues:

1. Multiple hierarchies

The argument is that having multiple hierarchies is messy from an interface viewpoint and it also limits and/or complicates the implementation. Cgroups v2 has a “unified hierarchy” where you can enable or disable controllers at the group level (disabling a controller effectively flattens the hierarchy at that point, from the viewpoint of that particular controller only; so there are still multiple hierarchies in a sense, but they are limited to sharing the same overall structure). I think this makes sense.

2. Thread granularity

The v1 interface assigned individual threads to control groups; in v2, processes are assigned. This makes sense because the hierarchy is really a process hierarchy; processes are children of other processes, not of threads. Many resources are arbitrated at the process level (eg threads all share the same memory), and those that aren’t can be dealt with in other ways (it’s already possible to assign different processing priorities to different threads, for example).

There’s another point raised during the discussion of process-vs-thread granularity however:

cgroups were delegated to individual applications so that they can create and manage their own sub-hierarchies and control resource distributions along them. This effectively raised cgroup to the status of a syscall-like API exposed to lay programs.

I’m having a hard time understanding why this is a bad thing. Shouldn’t a process be able to limit the resource allocation to its children? Wouldn’t that be a good thing?

cgroup controllers implemented a number of knobs which would never be accepted as public APIs because they were just adding control knobs to system-management pseudo filesystem. cgroup ended up with interface knobs which were not properly abstracted or refined and directly revealed kernel internal details. These knobs got exposed to individual applications through the ill-defined delegation mechanism
effectively abusing cgroup as a shortcut to implementing public APIs
without going through the required scrutiny.

I think the argument here is essentially that some controllers exposed things they shouldn’t have. That may well be true, but is hardly an argument for a complete re-design of the whole interface.

3. Processes and child groups with common parent

Cgroup v1 allowed processes and sub-groups to coexist within a parent group; specifically (and this is where the bollocks kicks in):

cgroup v1 allowed threads to be in any cgroups which created an interesting problem where threads belonging to a parent cgroup and its children cgroups competed for resources. This was nasty as two different types of entities competed and there was no obvious way to settle it. Different controllers did different things.

Cgroups v2 no longer allows a group with controllers enabled (except for the root group) to contain both processes and subgroups, but this is an awkward limitation that achieves nothing (other than perhaps a slight simplification of implementation at the cost of usability). It’s completely untrue that there is “no obvious way to settle it”. Consider the resource distribution models outlined in the same document:

  1. Weights – clearly processes have the default weight (specified as 100). Child groups have an assigned weight (as they already do).
  2. Limits – processes are unlimited (except by limits inherited through ancestor groups). Child groups have the specified limit.
  3. Protections – processes have no protection (except what is inherited). Child groups have the specified resource protection.
  4. Allocations – processes have no allocation (except what is inherited). Child groups have the specified resource protection.

In all cases above, it is completely clear how to settle competition for resources between processes and child groups.

(I would suggest that a nice additional feature would be the ability to specify child groups as having the combined weight of their children  – an option that is not currently available, but which could potentially be useful if you don’t want to control some particular resource at the level of that child).

The io controller implicitly created a hidden leaf node for each cgroup to host the threads [… which] made the interface messy and significantly complicated the implementation.

Well, that sucks. The IO controller should have been fixed as per above (and if you want to control the resources of the immediate process children as a group, then create a group for them, duh). There’s no need for the creation of hidden leaf nodes; the processes are leaf nodes. The fact that one or two controllers did something stupid / had implementation flaws is not a good argument for prohibiting groups from containing both process nodes and child groups. What’s more:

The root cgroup is exempt from this restriction. Root contains processes and anonymous resource consumption which can’t be associated with any other cgroups and requires special treatment from most controllers. How resource consumption in the root cgroup is governed is up to each controller.

So each controller is required to support contention between processes and groups anyway. That pretty much knocks the “it simplifies the implementation” argument on the head.

This clearly is a problem which needs to be addressed from cgroup core in a uniform way.

No; it’s a problem that needs to be addressed in the controllers in a uniform way. This restriction just makes the interface clunky.

As a basic example, let’s say I have some process that wants to run a child process with resource constraints. The first step then is to create a cgroup (call it B) which is a child of my own cgroup (A). I then enable some controller for B (note that I can’t enable the controller on A, since it contains processes). I then have to create another cgroup (C) inside B, so that I can move my child process into it.

Yes, that’s right: I had to create two cgroups just to control the resource allocation for one process. And this wasn’t even necessary with the old v1 interface. Things actually got worse.

Another example: let say I want to run a child process, but make sure it and any children it spawns get the combined timeslice that would have been given to the one process (in other words, I want it to have a regular fair share of processor time, but I don’t want it soaking up more cycles by forking a bunch of children). I can’t actually do this at all with the v2 interface: the closest I can come is [ed: put the child in a nested group as described above and then] put some hard limit on the group’s processor allocation, but that won’t reflect the dynamically changing amount of time that could be allocated to the parent process. (I could put the parent process in a group as well, but then it doesn’t get equal distribution with it siblings). The v1 interface at least made this potentially possible (though the necessary controller was not implemented, it seems).

Other points to note

Cgroups have a mechanism to notify userspace when a group becomes empty; this has vastly improved between v1 and v2, thankfully (for v1 you had to specify a binary to be launched as a notification, urgh).

Systemd uses the properties of Cgroups to realise a form of better session management: whereas even unprivileged processes can “daemonise” to escape their process hierarchy, they cannot so easily escape a control group. Unfortunately, however, there is no way to reliably kill off a group. The approach used by Systemd is to iterate through the process ids in the group and send a signal to each one; however, this is racy – the process may have died in the meantime, and even worse, the process ID may have been recycled and given to a new process (that’s right – Systemd potentially kills off the wrong process). It would be really nice if it were possible to send a signal to an entire group atomically.


Cgroups v2 gives us a unified hierarchy, improved “empty group” notification, and an annoying interface quirk which makes it in many respects more complicated than the v1 interface. Two steps forward, one backwards, I guess. On the plus side, I think the issue with the v2 interface should be fixable in a 100% backwards-compatible way. Whether that will actually happen or not, time will tell.

POSIX timer APIs are borked

I’m currently working on Dasynq, an event loop library in C++ (which is not yet in a state of being ready for use by external projects, though the functionality it currently exposes does work correctly as far as I know). It has come to the point where I want to add timer functionality, and this has been frustratingly tricky, mostly due to horribly designed APIs.

There are a few basic requirements to set out before I start:

  • There are essentially two types of timer – relative and absolute. I either want the timer to expire some given interval from now, or I want it to expire at some specific (“wall clock”) time. In the latter case, if the system time is changed the timeout should be suitably adjusted. (Example: if I set an alarm for 04:00, and the system change is changed by the user from 03:25 to 04:15, the alarm should expire immediately).
  • I want to be able to be sure that I can use a timer, at some point in the future. That is, I need to be able to allocate timers in advance (without necessarily arming them immediately) or at least to be able to re-set an existing timer to a specified timeout. I should be able to avoid the situation where I need a timer, which I knew I would need in advance, but am unable to create one due to resource limits / exhaustion.
  • I need a reasonable level of resolution. Timers should be usable for everything from running weekly tasks to animation timing.

With the above points in mind, let’s take a look at what POSIX provides.

POSIX timer APIs

First, the most basic timer-like call provided by POSIX is the alarm(…) function. It has second granularity, which rules it out immediately.

Then, there’s setitimer(…). This isn’t a particularly nice interface and delivers timer expiry via a signal. There is only one timer (well, there is one timer for each of several of different kinds of clock), which means that to allow for multiple timers to be managed we need to essentially multiplex the single timer; by itself this isn’t such a huge problem, but other limitations of the API make it fundamentally difficult, and the API is pretty broken to begin with in several ways.

The first problem with setitimer is that the interval timers deliver timeout events via signals, which is awkward, especially since setitimer itself is not async-signal-safe, meaning you can’t call it from within the signal handler to set the next desired timeout; if you want to multiplex multiple timeouts over the interval timer interface, you’re forced to turn asynchronous event notifications into synchronous events (which of course is what libraries like Dasynq are all about, so by itself this isn’t a huge problem).

The next problem with setitimer is that it only allows setting a relative timeout. If I want to get a timer notification at an absolute time, I need to get the current clock time (clock_gettime), calculate the time remaining until that time, and then set the timer. Not allowing an absolute timeout means that setting an alarm for a wall-clock time is pretty much impossible – since if the system time is changed by the user, the timer’s timeout interval won’t be adjusted. However, there’s a more subtle issue here: time might elapse between the calculation and setting the timer – the process could be preempted just after calculating the interval, and in unusual cases might not be scheduled again for a significant period of time. By the time it finally arms the timer, the interval is significantly incorrect. The only work to work around this (that I can think of, other than pretending that the problem doesn’t exist) is to check the clock time immediately after setting the interval, to make sure that it’s within a certain window of tolerance of the original measurement – and if not, to re-calculate the interval and reset the timer.

The safety check just described requires a minimum of two clock_gettime calls (which generally means two calls into the kernel) just for setting a timer. If multiple timers are being managed over the top of a single interval timer, that’s going to mean that two clock_gettime calls are required on each timer expiry.

Generalised POSIX timer interface

The real-time POSIX extensions also define timer_create, which appears to solve some of the problems above:

  • It allows the creation of multiple independent timers
  • It allows for specifying absolute (as well as relative) timeouts, and when using the realtime clock the timeout interval will be adjusted appropriately (for absolute timeouts) if the system time is altered

However, notification is still either via a signal, or via a thread (SIGEV_THREAD); the latter is problematic for implementations, because there is usually no way to detect notification failure if a thread cannot be created due to resource limits, and because it requires userspace support; on Linux you need to link with -lrt (and thus also pull in the pthreads library) to use timer_create etc, even if you don’t use SIGEV_THREAD. On OpenBSD the situation is worse – the realtime extensions are generally not supported, and create_timer et al are not available at all.

Even using timers with SIGEV_SIGNAL notifications is less than ideal. Using such timers in different threads requires cooperation between all threads, to either choose different signal numbers for notification or otherwise to have a common signal handler somehow suitably dispatch notifications to the correct thread.

Non-POSIX solutions

On Linux, timerfds (timerfd_create et al) provide an apparently sane solution to the whole messy problem – it supports multiple timers, supports absolute and relative timeouts, and delivers events by file handle notifications (can be used with select/poll/epoll etc). A large number of timers could be feasibly multiplexed over a single timerfd, which is good from a resource management perspective.

On OpenBSD (and various other BSDs) there is the option of using kqueue timers. This supports multiple timers, and neatly solves the problem of notification; however it only allows relative timeouts, and the timeout/interval cannot be changed, meaning that multiple timers cannot be multiplexed over a single kqueue timer. Even worse, it is not possible to pre-allocate timers; once created, they begin countdown immediately. This makes it impossible to discover resource allocation failure until the point that the timer is actually needed.

In Conclusion

The POSIX timer APIs are awkward and clunky. The setitimer functions support only limited use cases.  On the other hand, the generalised interface would be difficult to use in a library (since it either requires signal handling or multi-threading). On Linux, the timerfd interface is an ideal substitute. On other systems the general timer interface can be used, with some caveats and trade-offs, but it is not always available; on systems where it is not, and there is no system-specific replacement, the only option for wall-clock timers is to assume that the system clock does not change (other than by the usual tick) while the system is running.