Forgetting about the problem of memory

There’s a pattern that emerged in software some time ago, that bothers me: in a nutshell, it is that it’s become acceptable to assume that memory is unlimited. More precisely, it is the notion that it is acceptable for a program to crash if memory is exhausted.

It’s easy to guess some reasons why this the case: for one thing, memory is much more prevalent than it used to be. The first desktop computer I ever owned had 32kb of memory (I later owned a graphing calculator with the same processor and same amount of memory as that computer). My current desktop PC, on the other hand, literally has more than a million times that amount of memory.

Given such huge amounts of memory to play with, it’s no wonder that programs are written with the assumption that they will always have memory available. After all, you couldn’t possibly ever chew up a whole 32 gigabytes of RAM with just the handful of programs that you’d typically run on a desktop, could you? Surely it’s enough that we can forgot about the problem of ever running out of memory. (Many of us have found out the unfortunate truth, the hard way; in this age where simple GUI apps are sometimes bundled together with a whole web browser – and in which web browsers will happily let a web page eat up more and more memory – it certainly is possible to hit that limit).

But, for some time, various languages with garbage-collecting runtimes haven’t even exposed memory allocation failure to the underlying application (this is not universally the case, but it’s common enough). This means that a program written in such a language that can’t allocate memory at some point will generally crash – hopefully with a suitable error message, but by no means with any sort of certainty of a clean shutdown.

This principle has been extended to various libraries even for languages like C, where checking for allocation failure is (at least in principle) straight-forward. Glib (one of the foundation libraries underlying the GTK GUI toolkit) is one example: the g_malloc function that it provides will terminate the calling process if requested memory can’t be allocated (a g_try_malloc function also exists, but it’s clear that the g_malloc approach to “handling” failure is considered acceptable, and any library or program built on Glib should typically be considered prone to unscheduled termination in the face of an allocation failure).

Apart from the increased availability of memory, I assume that the other reason for ignoring the possibility of allocation failure is just because it is easier. Proper error handling has traditionally been tedious, and memory allocation operations tend to be prolific; handling allocation failure can mean having to incorporate error paths, and propagate errors, through parts of a program that could otherwise be much simpler. As software gets larger, and more complex, being able to ignore this particular type of failure becomes more attractive.

The various “replacement for C” languages that have been springing up often have “easier error handling” as a feature – although they don’t always extend this to allocation failure; the Rust standard library, for example, generally takes the “panic on allocation failure” approach (I believe there has been work to offer failure-returning functions as an alternative, but even with Rust’s error-handling paradigms it is no doubt going to introduce some complexity into an application to make use of these; also, it might not be clear if Rust libraries will handle allocation failures without a panic, meaning that a developer needs to be very careful if they really want to create an application which can gracefully handle such failure).

Even beyond handling allocation failure in applications, the operating system might not expect (or even allow) applications to handle memory allocation failure. Linux, as it’s typically configured, has overcommit enabled, meaning that it will allow memory allocations to “succeed” when only address space has actually been allocated in the application; the real memory allocation occurs when the application then uses this address space by storing data into it. Since at that point there is no real way for the application to handle allocation failure, applications will be killed off by the kernel when such failure occurs (via the “OOM killer”). Overcommit can be disabled, theoretically, but to my dismay I have discovered recently that this doesn’t play well with cgroups (Linux’s resource control feature for process groups): an application in a cgroup that attempts to allocate more than the hard limit for the cgroup will generally be terminated, rather than have the allocation fail, regardless of the overcommit setting.

If the kernel doesn’t properly honour allocation requests, and will kill applications without warning when memory becomes exhausted, there’s certainly an argument to be made that there’s not much point for an application to try to be resilient to allocation failure.

But is this really how it should be?

I’m concerned, personally, about this notion that processes can just be killed off by the system. It rings false. We have these amazing machines at our disposal, with fantastic ability to precisely process data in whatever way and for whatever purpose we want – but, prone to sudden failures that cannot really be predicted or fully controlled, and which mean the system at a whole is fundamentally less reliable. Is it really ok that any process on the system might just be terminated? (Linux’s OOM uses heuristics to try and terminate the “right” process, but of course that doesn’t necessarily correspond to what the user or system administrator would want).

I’ve discussed desktops but the problem is still a problem on servers, perhaps more so; wouldn’t it be better if critical processes are able to detect and respond to memory scarcity rather than be killed off arbitrarily? Isn’t scaling back, at the application level, better than total failure, at least in some cases?

Linux could be fixed so that OOM was not needed on properly configured systems, even with cgroups; anyway there are other operating systems that, reportedly, have better behaviour. That would still leave the applications which don’t handle allocation failure, of course; fixing that would take (as well as a lot of work) a change in developer mindset. The thing is, while the odd application crash due to memory exhaustion probably doesn’t bother some, it certainly bothers me. Do we really trust that applications will reliably save necessary state at all times prior to crashing due to a malloc failure? Are we really ok with important system processes occasionally dying, with system functionality accordingly affected? Wouldn’t it be better if this didn’t happen?

I’d like to say no, but the current consensus would seem to be against me.


Addendum:

I tried really hard in the above to be clear how minimal a claim I was making, but there are comments that I’ve seen and discussions I’ve been embroiled in which make it clear this was not understood by at least by some readers. To sum up in what is hopefully an unambiguous fashion:

  • I believe some programs – not all, not even most – in some circumstances at least, need to or should be able to reliably handle an allocation failure. This is a claim I did not think would be contentious, and I haven’t been willing to argue it as that wasn’t the intention of the piece (but see below).
  • I’m aware of plenty of arguments (of varying quality) why this doesn’t apply to all programs (or even, why it doesn’t apply to a majority of programs). I haven’t argued, or claimed, that it does.
  • I’m critical of overcommit at the operating system level, because it severely impedes the possibility of handling allocation failure at the application level.
  • I’m also critical of languages and/or libraries, which make responding to allocation failure difficult or impossible. But (and admittedly, this is exception wasn’t explicit in the article) if used for an application where termination on allocation failure is acceptable, then this criticism doesn’t apply.
  • I’m interested in exploring language and API design ideas that could make handling allocation failure easier.

The one paragraph in particular that I think could possibly have caused confusion is this one:

I’m concerned, personally, about this notion that processes can just be killed off by the system. It rings false. We have these amazing machines at our disposal, with fantastic ability to precisely process data in whatever way and for whatever purpose we want – but, prone to sudden failures that cannot really be predicted or fully controlled, and which mean the system at a whole is fundamentally less reliable. Is it really ok that any process on the system might just be terminated?

Probably, there should have been emphasis on the “any” (in “any process on the system”) to make it clear what I was really saying here, and perhaps the “system at a whole is fundamentally less reliable” is unnecessary fluff.

There’s also a question in the concluding paragraph:

Do we really trust that applications will reliably save necessary state at all times prior to crashing due to a malloc failure?

This was a misstep and very much not the question I wanted to ask; I can see how it’s misleading. The right question was the one that follows it:

Are we really ok with important system processes occasionally dying, with system functionality accordingly affected? Wouldn’t it be better if this didn’t happen?

Despite those slips, I think if you read the whole article carefully the key thrust should be apparent.

For anyone wanting for a case where an application really does need to be able to handle allocation failures, I recently stumbled across one really good example:

To start with, I write databases for a living. I run my code on containers with 128MB when the user uses a database that is 100s of GB in size. Even if running on proper server machines, I almost always have to deal with datasets that are bigger than memory. Running out of memory happens to us pretty much every single time we start the program. And handling this scenario robustly is important to building system software. In this case, planning accordingly in my view is not using a language that can put me in a hole. This is not theoretical, that is real scenario that we have to deal with.

The other example is service managers, of which I am the primary author of one (Dinit), which is largely what got me thinking about this issue in the first place. A service manager has a system-level role and if one dies unexpectedly it potentially leaves the whole system in an awkward state (and it’s not in general possible to recover just be restarting the service manager). In the worst case, a program running as PID 1 on Linux which terminates will cause the kernel to panic. (The OOM killer will not target PID 1, but it still should be able to handle regular allocation failure gracefully). However, I’m aware of some service manager projects written using languages that will not allow handling allocation failure, and it concerns me.

Thoughts on password prompts and secure desktop environments

I’ve been thinking a little lately about desktop security – what makes a desktop system (with a graphical interface) secure or insecure? How is desktop security supposed to work, in particular on a unix-y system (Linux or one of the BSDs, for example)?

A quite common occurrence on today’s systems is to be prompted for your password—or perhaps for “an administrator” password—when you try, from the desktop environment, to perform some action that requires extended privileges; probably the most common example would be installing a new package, another is changing system configuration such as network settings. The two cases of asking for your own password or for another one are actually different in ways that might not initially be obvious. Let’s look at the first case: You have already logged in; your user credentials are supposedly established; why then is your password required?. There is an assumption that you are allowed to perform the requested action (otherwise your ability to enter your own password should make no difference). The only reason that I see for prompting for a password, then, is to ensure that:

  1. The user sitting in the seat is still the same user who logged in, i.e. it’s not the case that another individual has taken advantage of you forgetting to log out or lock the screen before you walked away; and
  2. The action is indeed being knowingly requested by the user, and not for instance by some rogue software running in the user’s session. By prompting for a password, the system is alerting the user to the fact that a privileged action has been requested.

Both of these are clearly in the category of mitigation—the password request is designed to limit the damage/further intrusion that can be performed by an already compromised account. But are they really effective? I’m not so sure about this, particularly with current solutions, and they may introduce other problems. In particular I find the problem of secure password entry problematic. Consider again:

  1. We ask the user to enter their password to perform certain actions
  2. We do this because we assume the account may be compromised

There’s an implicit assumption, then, that the user is able to enter their password and have it checked by some more privileged part of the system, without another process which is running as the same user being able to see the password (if they could see the password, they could enter it to accomplish the actions we are trying to prevent them from performing). This is only likely to be possible if the display system itself (eg the X server) is running as a different user* (though not necessarily as root), and that it provides facilities to enable secure input without another process eavesdropping, and that the program requesting the password is likewise also running as a separate user—otherwise, there’s little to stop a malicious actor from connecting to the relevant process with a debugger and observing all input. In that case, forcing the user to enter their password is (a) not necessarily going to prevent an attacker from performing the protected actions anyway, and, worse, (b) actually making it easier for an attacker to recover the users password by forcing them to enter it in contexts where it can be observed by other processes.

* Running as a different user is necessary since otherwise the process can be attached via ptrace, eg. a debugger. I’ll note at this point that more recent versions of Mac OS no longer arbitrary programs to ptrace another process; debugger executables must be signed with a certificate which gives them this privilege.

Compare this to the second case, where you must enter a separate password (eg the root password) to perform a certain action. The implicit assumption here is different: your user account doesn’t have permission to perform the action, and the allowance for entering a password is to cover the case where either (a) you actually are an administrator but are currently using an unprivileged account or (b) another, privileged, user is willing to supply their password to allow for a particular action to be invoked from your account on a one-off basis. The assumption that your account may be in the hands of a malicious actor is no longer necessary (although of course it may well still be the case).

So which is better? The first theoretically mitigates compromised user accounts, but if not done properly has little efficacy and in fact leads to potential password leakage, which is arguably an even worse outcome. The second at least has additional utility in that it can grant access to functions not available to the current user, but if used as a substitute for the first (i.e. if used routinely by a user to perform actions for which their account lacks suitable privileges) then it suffers the same problems, and is in fact worse since it potentially leaks an administrator password which isn’t tied to the compromised account.

Note that, given full compromise of an account, it would anyway be fairly trivial to pop up an authentication window in an attempt to trick the user into supplying their password. Full mitigation of this could be achieved by requiring the disciplined use a SaK (secure attention key) which has seemingly gone out of favour (the Linux SaK support would kill the X server when pressed, which makes it useless in this context anyway). Another possibility for mitigation would be to show the user a consistent secret image or phrase when prompting them for authentication, so they knew that the request came from the system; this would ideally be done in such a way that prevented other programs from grabbing the screen or otherwise recovering the image. Again, with X currently, I believe this may be difficult or impossible, but could be done in principle with an appropriate X extension or other modification of the X server.

To summarise, prompting the user for a password to perform certain actions only increases security if done carefully and with certain constraints. The user should be able to verify that a password request comes from the system, not an arbitrary process; additionally, no other process running with user privileges should be able to intercept password entry. Without meeting these constraints, prompting for a password accomplishes two things: First, it makes it more complex (but does not make it impossible, generally) for a compromised process to issue a command which the user has privilege but which is behind an ask-password barrier. Secondly, it prevents an opportunistic person, who already has physical access to the machine, from issuing such commands when the real user has left their machine unattended. These are perhaps good things to achieve (I’d argue the second is largely useless), but in this case they come with a cost: inconvenience to the user, who has to enter their password more often that would otherwise be necessary, and potentially making it easier for sophisticated attackers to obtain the user password (or worse, that of an administrator).

Given the above, I’m thinking that current Linux desktop systems which prompt for a password to initiate certain actions are actually doing the wrong thing.

Edit: I note that Linux distributions may disallow arbitrary ptrace, and also that ptrace can be disabled via prctl() (though this seems like it would be race-prone). It’s still not clear to me that asking for a password with X is secure; I guess that XGrabKeyboard is supposed to make it so. This still leaves the possibility of displaying a fake password entry dialog, though, and tricking the user into supplying their password that way.

Bad utmp implementations in Glibc and FreeBSD

I recently released another version – 0.5.0 – of Dinit, the service manager / init system. There were a number of minor improvements, including to the build system (just running “make” or “gmake” should be enough on any of the systems which have a pre-defined configuration, no need to edit mconfig by hand), but the main features of the release were S6-compatible readiness notification, and support for updating the utmp database.

At this point, I’d expect, there might be one or two readers wondering what this “utmp” database might be. On Linux you can find out easily enough via “man utmp” in the terminal:

The utmp file allows one to discover information about who is currently
using the system. There may be more users currently using the system,
because not all programs use utmp logging.

The OpenBSD man page clarifies:

The utmp file is used by the programs users(1), w(1) and who(1).

In other words, utmp is a record of who is currently logged in to the system (another file, “wtmp”, records all logins and logouts, as well as, potentially, certain system events such as reboots and time updates). This is a hint at the main motivation for having utmp support in Dinit – I wanted the “who” command to correctly report current logins (and I wanted boot time to be correctly recorded in the wtmp file).

However, when I began to implement the support for utmp and wtmp in Dinit, I also started to think about how these databases worked. I knew already that they were simply flat file databases – i.e. each record is a fixed number of bytes, the size of the “struct utmp” structure. The files are normally readable by unprivileged users, so that utilities such as who(1) don’t need to be setuid/setgid. Updating and reading the database is done (behind the scenes) via normal file system read and writes, via the getutent(3)/pututline(3) family of functions, their getutxent/pututxline POSIX equivalents, or by the higher-level login(3) and logout(3) functions (found in libutil; In OpenBSD, only the latter are available, the lower-level routines don’t exist).

I wondered: If the files consist of fixed-sized records, and are readable by regular users, how is consistency maintained? That is – how can a process ensure that, when it updates the database, it doesn’t conflict with another process also attempting to update the database at the same time? Similarly, how can a process reading an entry from the database be sure that it receives a consistent, full record and not a record which has been partially updated? (after all, POSIX allows that a write(2) call can return without having written all the requested bytes, and I’m not aware of Linux or any of the *BSDs documenting that this cannot happen for regular files). Clearly, some kind of locking is needed; a process that wants to write to or read from the database locks it first, performs its operation, and then unlocks the database. Once again, this happens under the hood, in the implementation of the getutent/pututline functions or their equivalents.

Then I wondered: if a user process is able to lock the utmp file, and this prevents updates, what’s to stop a user process from manually acquiring and then holding such a lock for a long – even practically infinite – duration? This would prevent the database from being updated, and would perhaps even prevent logins/logouts from completing. Unfortunately, the answer is – nothing; and yes, it is possible on different systems to prevent the database from being correctly updated or even to prevent all other users – including root – from logging in to the system.

Specifically:

  • On Linux with Glibc (or, I suppose, any other system with Glibc), updates to the database can be prevented completely, and logins can be delayed by 10 seconds (bug filed);
  • On FreeBSD, updates to the database can be prevented and logins prevented indefinitely (bug filed). Note that on FreeBSD the file is named “utx.active” but is otherwise the same as “utmp” on other systems. A patch was quickly put together after I filed this bug, but progress on it has seemingly stalled.

I haven’t checked all other systems but suspect that various other BSDs could be susceptible to related problems. On the other hand, some systems are immune:

  • Linux with Musl, because Musl doesn’t implement the utmp functions (though it has no-op stubs). I don’t understand why the Musl FAQ claims that you need a setuid program to update the database: it seems perfectly reasonable to simply limit modification to daemons already running as root or in a particular group. (Perhaps it is referring to having terminal emulators create utmp entries, which the Linux “utmp” manpage suggests is something that happens, though this also seems unnecessary to me).
  • OpenBSD structures the utmp file so there is one particular entry per tty device, and so avoids the need for locking (writes to the same tty entry should naturally be serialised, since they are either for login or logout). It performs no locking for reading, which leaves open the possibility of reading a partially written entry, though this is certainly a less severe problem than the ones affecting Glibc/FreeBSD.

The whole thing isn’t an issue for single-user systems, but for multiple-user systems it is more of a concern. On such systems, I’d recommend making /var/run/utmp and /var/run/wtmp (or their equivalents) readable only by the owner and group, or removing them altogether, and forgoing the ability for unprivileged users to run the “who” command. Otherwise, you risk users being able to deny logins or prevent them being recorded, as per above.

As for fixes which still allow unprivileged processes to read the database, I’ve come to the conclusion that the best option is to use locking (on a separate, root-only file) only for write operations, and live with the limitation that it is theoretically possible for a program to read a partially-updated entry; this seems unlikely to ever happen, let alone actually cause a significant problem, in practice. To completely solve the problem, you’d either need atomic read and write support on files, or a secondary mechanism for accessing the database which obviated the concurrency problem (eg access the database via communication with a running daemon which can serialize requests). Or, perhaps Musl is taking the right approach by simply excluding the functionality.

Bus1 is the new Kdbus

For some time there have been one or two developers essentially trying to move part of D-Bus into the kernel, mainly (as far as I understand) for efficiency reasons. This has so far culminated in the “Bus1” patch series – read the announcement from here (more discussion follows):

Bus1 is a local IPC system, which provides a decentralized infrastructure to share objects between local peers. The main building blocks are nodes and handles. Nodes represent objects of a local peer, while handles represent descriptors that point to a node. Nodes can be created and destroyed by any peer, and they will always remain owned by their respective creator. Handles on the other hand, are used to refer to nodes and can be passed around with messages as auxiliary data. Whenever a handle is transferred, the receiver will get its own handle allocated, pointing to the same node as the original handle.

Any peer can send messages directed at one of their handles. This will transfer the message to the owner of the node the handle points to. If a peer does not posess a handle to a given node, it will not be able to send a message to that node. That is, handles provide exclusive access management. Anyone that somehow acquired a handle to a node is privileged to further send this handle to other peers. As such, access management is transitive. Once a peer acquired a handle, it cannot be revoked again. However, a node owner can, at anytime, destroy a node. This will effectively unbind all existing handles to that node on any peer, notifying each one of the destruction.

Unlike nodes and handles, peers cannot be addressed directly. In fact, peers are completely disconnected entities. A peer is merely an anchor of a set of nodes and handles, including an incoming message queue for any of those. Whether multiple nodes are all part of the same peer, or part of different peers does not affect the remote view of those. Peers solely exist as management entity and command dispatcher to local processes.

The set of actors on a system is completely decentralized. There is no global component involved that provides a central registry or discovery mechanism. Furthermore, communication between peers only involves those peers, and does not affect any other peer in any way. No global communication lock is taken. However, any communication is still globally ordered, including unicasts, multicasts, and notifications.

Ok, and maybe I’m missing something, but: replace “nodes” with “unix domain sockets” and replace “handles” with “file descriptors” and, er, haven’t we already got that? (Ok, maybe not quite exactly – but is it perhaps good enough?)

Eg:

Sockets represent objects of a local peer, while descriptors represent descriptors that point to a socket. Sockets can be created and destroyed by any peer, and they will always remain owned by their respective creator.

right?

Descriptors on the other hand, are used to refer to socket connections and can be passed around with messages as auxiliary data. Whenever a descriptor is transferred, the receiver will get its own descriptor allocated, pointing to the same socket connection as the original handle.

It’s already possible to pass file descriptors to another process via a socket. Technically passing a file descriptor connected to a socket gives the other peer the same connection to the socket, which is probably not conceptually identical to passing handles, which (if I understand correctly) is more like having another connection to the same socket. But could it be so hard to devise a standard protocol for requesting a file descriptor with a secondary connection, specifically so that it can be passed to another process?

Any peer can send messages directed at one of their file descriptors. This will transfer the message to the owner of the socket the descriptor points to. If a peer does not posess a descriptor to a given socket, it will not be able to send a message to that socket. That is, descriptors provide exclusive access management. Anyone that somehow acquired a descriptor to a socket is privileged to further send this descriptor to other peers. As such, access management is transitive. Once a peer acquired a descriptor, it cannot be revoked again. However, a socket owner can, at anytime, close all connections to that socket. This will effectively unbind all existing descriptors to that socket on any peer, notifying each one of the destruction

right? (except that individual connections to a socket can be “revoked” i.e. closed, which is surely an improvement if anything).

Unlike sockets and file descriptors, peers cannot be addressed directly. In fact, peers are completely disconnected entities. A peer is merely an anchor of a set of sockets and file descriptors, including an incoming message queue for any of those. Whether multiple sockets are all part of the same peer, or part of different peers does not affect the remote view of those. Peers solely exist as management entity and command dispatcher to local processes.

I suspect the only difference is that each “peer” has a single receive queue for all its nodes, rather than one per connection.

The set of actors on a system is completely decentralized. There is no global component involved that provides a central registry or discovery mechanism. Furthermore, communication between peers only involves those peers, and does not affect any other peer in any way. No global communication lock is taken. However, any communication is still globally ordered, including unicasts, multicasts, and notifications.

I think this is meant to read as “no, it’s not the D-Bus daemon functionality being subsumed in the kernel”.

But as per all above, is Bus1 really necessary at all? Is multicasting to multiple clients so common that we need a whole new IPC mechanism to make it more efficient? Does global ordering of messages to different services ever actually matter? I’m not really convinced.

Cgroups v2: resource management done even worse the second time around

Background

Some Linux kernel developers have been attempting to produce decent, process-hierarchy based, resource control. The first real attempt at this was Cgroups (for “control groups”; also known as “cgroups” due to the emerging trend of refusing to capitalise proper nouns, see also [Ss]ystemd). The original Cgroups interface is known as Cgroups-v1, and is described in the kernel documentation. The main principles are pretty straightforward: you mount a special “cgroup” file system somewhere, and create directories in it to represent control groups whose resource usage you want to limit; each directory created will automatically contain a bunch of files used to control the group. You can create nested hierarchy such that there are groups within other groups, and the nested groups share the resources of their parent group (and may be further limited). You move a process into a group by writing its PID into one of the group’s control files. A group therefore potentially contains both processes and subgroups.

The two obvious resources you might want to limit are memory and CPU time, and each of these has a “controller”, but there are potentially others (such as I/O bandwidth), and some Cgroup controllers don’t really manage resource utilisation as such (eg the “freezer” controller/subsystem). The Cgroups v1 interface allowed creating multiple hierarchies with different controllers attached to them (the value of this is dubious, but the possibility is there).

Importantly, processes inherit their cgroup membership from their parent process, and cannot move themselves out of (or into) a cgroup unless they have appropriate privileges, which means that a process cannot escape its any limitations that have been imposed on it by forking. Compare this with the use of setrlimit, where a process’s use of memory (for example) can be limited using an RLIMIT_AS (address space) limitation, but the process can fork and its children can consume additional memory without drawing from the resources of the original process. With Cgroups on the other hand, a process and all its children draw resources from the containing group.

Cgroups are all very nice, in principle, but the interface was perceived to have some problems, leading to an effort to produce an improved interface – Cgroups v2 – and this is where the fun begins.

Cgroups v2

The second version of the Cgroups interface is also described in kernel documentation; the most interesting part from the perspective of the issues that will be discussed here are in an appendix, titled “R. Issues with v1 and Rationales for v2”. Let’s take a look at these issues:

1. Multiple hierarchies

The argument is that having multiple hierarchies is messy from an interface viewpoint and it also limits and/or complicates the implementation. Cgroups v2 has a “unified hierarchy” where you can enable or disable controllers at the group level (disabling a controller effectively flattens the hierarchy at that point, from the viewpoint of that particular controller only; so there are still multiple hierarchies in a sense, but they are limited to sharing the same overall structure). I think this makes sense.

2. Thread granularity

The v1 interface assigned individual threads to control groups; in v2, processes are assigned. This makes sense because the hierarchy is really a process hierarchy; processes are children of other processes, not of threads. Many resources are arbitrated at the process level (eg threads all share the same memory), and those that aren’t can be dealt with in other ways (it’s already possible to assign different processing priorities to different threads, for example).

There’s another point raised during the discussion of process-vs-thread granularity however:

cgroups were delegated to individual applications so that they can create and manage their own sub-hierarchies and control resource distributions along them. This effectively raised cgroup to the status of a syscall-like API exposed to lay programs.

I’m having a hard time understanding why this is a bad thing. Shouldn’t a process be able to limit the resource allocation to its children? Wouldn’t that be a good thing?

cgroup controllers implemented a number of knobs which would never be accepted as public APIs because they were just adding control knobs to system-management pseudo filesystem. cgroup ended up with interface knobs which were not properly abstracted or refined and directly revealed kernel internal details. These knobs got exposed to individual applications through the ill-defined delegation mechanism
effectively abusing cgroup as a shortcut to implementing public APIs
without going through the required scrutiny.

I think the argument here is essentially that some controllers exposed things they shouldn’t have. That may well be true, but is hardly an argument for a complete re-design of the whole interface.

3. Processes and child groups with common parent

Cgroup v1 allowed processes and sub-groups to coexist within a parent group; specifically (and this is where the bollocks kicks in):

cgroup v1 allowed threads to be in any cgroups which created an interesting problem where threads belonging to a parent cgroup and its children cgroups competed for resources. This was nasty as two different types of entities competed and there was no obvious way to settle it. Different controllers did different things.

Cgroups v2 no longer allows a group with controllers enabled (except for the root group) to contain both processes and subgroups, but this is an awkward limitation that achieves nothing (other than perhaps a slight simplification of implementation at the cost of usability). It’s completely untrue that there is “no obvious way to settle it”. Consider the resource distribution models outlined in the same document:

  1. Weights – clearly processes have the default weight (specified as 100). Child groups have an assigned weight (as they already do).
  2. Limits – processes are unlimited (except by limits inherited through ancestor groups). Child groups have the specified limit.
  3. Protections – processes have no protection (except what is inherited). Child groups have the specified resource protection.
  4. Allocations – processes have no allocation (except what is inherited). Child groups have the specified resource protection.

In all cases above, it is completely clear how to settle competition for resources between processes and child groups.

(I would suggest that a nice additional feature would be the ability to specify child groups as having the combined weight of their children  – an option that is not currently available, but which could potentially be useful if you don’t want to control some particular resource at the level of that child).

The io controller implicitly created a hidden leaf node for each cgroup to host the threads [… which] made the interface messy and significantly complicated the implementation.

Well, that sucks. The IO controller should have been fixed as per above (and if you want to control the resources of the immediate process children as a group, then create a group for them, duh). There’s no need for the creation of hidden leaf nodes; the processes are leaf nodes. The fact that one or two controllers did something stupid / had implementation flaws is not a good argument for prohibiting groups from containing both process nodes and child groups. What’s more:

The root cgroup is exempt from this restriction. Root contains processes and anonymous resource consumption which can’t be associated with any other cgroups and requires special treatment from most controllers. How resource consumption in the root cgroup is governed is up to each controller.

So each controller is required to support contention between processes and groups anyway. That pretty much knocks the “it simplifies the implementation” argument on the head.

This clearly is a problem which needs to be addressed from cgroup core in a uniform way.

No; it’s a problem that needs to be addressed in the controllers in a uniform way. This restriction just makes the interface clunky.

As a basic example, let’s say I have some process that wants to run a child process with resource constraints. The first step then is to create a cgroup (call it B) which is a child of my own cgroup (A). I then enable some controller for B (note that I can’t enable the controller on A, since it contains processes). I then have to create another cgroup (C) inside B, so that I can move my child process into it.

Yes, that’s right: I had to create two cgroups just to control the resource allocation for one process. And this wasn’t even necessary with the old v1 interface. Things actually got worse.

Another example: let say I want to run a child process, but make sure it and any children it spawns get the combined timeslice that would have been given to the one process (in other words, I want it to have a regular fair share of processor time, but I don’t want it soaking up more cycles by forking a bunch of children). I can’t actually do this at all with the v2 interface: the closest I can come is [ed: put the child in a nested group as described above and then] put some hard limit on the group’s processor allocation, but that won’t reflect the dynamically changing amount of time that could be allocated to the parent process. (I could put the parent process in a group as well, but then it doesn’t get equal distribution with it siblings). The v1 interface at least made this potentially possible (though the necessary controller was not implemented, it seems).

Other points to note

Cgroups have a mechanism to notify userspace when a group becomes empty; this has vastly improved between v1 and v2, thankfully (for v1 you had to specify a binary to be launched as a notification, urgh).

Systemd uses the properties of Cgroups to realise a form of better session management: whereas even unprivileged processes can “daemonise” to escape their process hierarchy, they cannot so easily escape a control group. Unfortunately, however, there is no way to reliably kill off a group. The approach used by Systemd is to iterate through the process ids in the group and send a signal to each one; however, this is racy – the process may have died in the meantime, and even worse, the process ID may have been recycled and given to a new process (that’s right – Systemd potentially kills off the wrong process). It would be really nice if it were possible to send a signal to an entire group atomically.

Summary

Cgroups v2 gives us a unified hierarchy, improved “empty group” notification, and an annoying interface quirk which makes it in many respects more complicated than the v1 interface. Two steps forward, one backwards, I guess. On the plus side, I think the issue with the v2 interface should be fixable in a 100% backwards-compatible way. Whether that will actually happen or not, time will tell.

POSIX timer APIs are borked

I’m currently working on Dasynq, an event loop library in C++ (which is not yet in a state of being ready for use by external projects, though the functionality it currently exposes does work correctly as far as I know). It has come to the point where I want to add timer functionality, and this has been frustratingly tricky, mostly due to horribly designed APIs.

There are a few basic requirements to set out before I start:

  • There are essentially two types of timer – relative and absolute. I either want the timer to expire some given interval from now, or I want it to expire at some specific (“wall clock”) time. In the latter case, if the system time is changed the timeout should be suitably adjusted. (Example: if I set an alarm for 04:00, and the system change is changed by the user from 03:25 to 04:15, the alarm should expire immediately).
  • I want to be able to be sure that I can use a timer, at some point in the future. That is, I need to be able to allocate timers in advance (without necessarily arming them immediately) or at least to be able to re-set an existing timer to a specified timeout. I should be able to avoid the situation where I need a timer, which I knew I would need in advance, but am unable to create one due to resource limits / exhaustion.
  • I need a reasonable level of resolution. Timers should be usable for everything from running weekly tasks to animation timing.

With the above points in mind, let’s take a look at what POSIX provides.

POSIX timer APIs

First, the most basic timer-like call provided by POSIX is the alarm(…) function. It has second granularity, which rules it out immediately.

Then, there’s setitimer(…). This isn’t a particularly nice interface and delivers timer expiry via a signal. There is only one timer (well, there is one timer for each of several of different kinds of clock), which means that to allow for multiple timers to be managed we need to essentially multiplex the single timer; by itself this isn’t such a huge problem, but other limitations of the API make it fundamentally difficult, and the API is pretty broken to begin with in several ways.

The first problem with setitimer is that the interval timers deliver timeout events via signals, which is awkward, especially since setitimer itself is not async-signal-safe, meaning you can’t call it from within the signal handler to set the next desired timeout; if you want to multiplex multiple timeouts over the interval timer interface, you’re forced to turn asynchronous event notifications into synchronous events (which of course is what libraries like Dasynq are all about, so by itself this isn’t a huge problem).

The next problem with setitimer is that it only allows setting a relative timeout. If I want to get a timer notification at an absolute time, I need to get the current clock time (clock_gettime), calculate the time remaining until that time, and then set the timer. Not allowing an absolute timeout means that setting an alarm for a wall-clock time is pretty much impossible – since if the system time is changed by the user, the timer’s timeout interval won’t be adjusted. [Edit: removed discussion of what I think was actually an imagined problem].

Generalised POSIX timer interface

The real-time POSIX extensions also define timer_create, which appears to solve some of the problems above:

  • It allows the creation of multiple independent timers
  • It allows for specifying absolute (as well as relative) timeouts, and when using the realtime clock the timeout interval will be adjusted appropriately (for absolute timeouts) if the system time is altered

However, notification is still either via a signal, or via a thread (SIGEV_THREAD); the latter is problematic for implementations, because there is usually no way to detect notification failure if a thread cannot be created due to resource limits, and because it requires userspace support; on Linux you need to link with -lrt (and thus also pull in the pthreads library) to use timer_create etc, even if you don’t use SIGEV_THREAD. On OpenBSD the situation is worse – the realtime extensions are generally not supported, and create_timer et al are not available at all.

Even using timers with SIGEV_SIGNAL notifications is less than ideal. Using such timers in different threads requires cooperation between all threads, to either choose different signal numbers for notification or otherwise to have a common signal handler somehow suitably dispatch notifications to the correct thread.

Non-POSIX solutions

On Linux, timerfds (timerfd_create et al) provide an apparently sane solution to the whole messy problem – it supports multiple timers, supports absolute and relative timeouts, and delivers events by file handle notifications (can be used with select/poll/epoll etc). A large number of timers could be feasibly multiplexed over a single timerfd, which is good from a resource management perspective.

On OpenBSD (and various other BSDs) there is the option of using kqueue timers. This supports multiple timers, and neatly solves the problem of notification; however it only allows relative timeouts, and the timeout/interval cannot be changed, meaning that multiple timers cannot be multiplexed over a single kqueue timer. Even worse, it is not possible to pre-allocate timers; once created, they begin countdown immediately. This makes it impossible to discover resource allocation failure until the point that the timer is actually needed.

In Conclusion

The POSIX timer APIs are awkward and clunky. The setitimer functions support only limited use cases.  On the other hand, the generalised interface would be difficult to use in a library (since it either requires signal handling or multi-threading). On Linux, the timerfd interface is an ideal substitute. On other systems the general timer interface can be used, with some caveats and trade-offs, but it is not always available; on systems where it is not, and there is no system-specific replacement, the only option for wall-clock timers is to assume that the system clock does not change (other than by the usual tick) while the system is running.