Forgetting about the problem of memory

There’s a pattern that emerged in software some time ago, that bothers me: in a nutshell, it is that it’s become acceptable to assume that memory is unlimited. More precisely, it is the notion that it is acceptable for a program to crash if memory is exhausted.

It’s easy to guess some reasons why this the case: for one thing, memory is much more prevalent than it used to be. The first desktop computer I ever owned had 32kb of memory (I later owned a graphing calculator with the same processor and same amount of memory as that computer). My current desktop PC, on the other hand, literally has more than a million times that amount of memory.

Given such huge amounts of memory to play with, it’s no wonder that programs are written with the assumption that they will always have memory available. After all, you couldn’t possibly ever chew up a whole 32 gigabytes of RAM with just the handful of programs that you’d typically run on a desktop, could you? Surely it’s enough that we can forgot about the problem of ever running out of memory. (Many of us have found out the unfortunate truth, the hard way; in this age where simple GUI apps are sometimes bundled together with a whole web browser – and in which web browsers will happily let a web page eat up more and more memory – it certainly is possible to hit that limit).

But, for some time, various languages with garbage-collecting runtimes haven’t even exposed memory allocation failure to the underlying application (this is not universally the case, but it’s common enough). This means that a program written in such a language that can’t allocate memory at some point will generally crash – hopefully with a suitable error message, but by no means with any sort of certainty of a clean shutdown.

This principle has been extended to various libraries even for languages like C, where checking for allocation failure is (at least in principle) straight-forward. Glib (one of the foundation libraries underlying the GTK GUI toolkit) is one example: the g_malloc function that it provides will terminate the calling process if requested memory can’t be allocated (a g_try_malloc function also exists, but it’s clear that the g_malloc approach to “handling” failure is considered acceptable, and any library or program built on Glib should typically be considered prone to unscheduled termination in the face of an allocation failure).

Apart from the increased availability of memory, I assume that the other reason for ignoring the possibility of allocation failure is just because it is easier. Proper error handling has traditionally been tedious, and memory allocation operations tend to be prolific; handling allocation failure can mean having to incorporate error paths, and propagate errors, through parts of a program that could otherwise be much simpler. As software gets larger, and more complex, being able to ignore this particular type of failure becomes more attractive.

The various “replacement for C” languages that have been springing up often have “easier error handling” as a feature – although they don’t always extend this to allocation failure; the Rust standard library, for example, generally takes the “panic on allocation failure” approach (I believe there has been work to offer failure-returning functions as an alternative, but even with Rust’s error-handling paradigms it is no doubt going to introduce some complexity into an application to make use of these; also, it might not be clear if Rust libraries will handle allocation failures without a panic, meaning that a developer needs to be very careful if they really want to create an application which can gracefully handle such failure).

Even beyond handling allocation failure in applications, the operating system might not expect (or even allow) applications to handle memory allocation failure. Linux, as it’s typically configured, has overcommit enabled, meaning that it will allow memory allocations to “succeed” when only address space has actually been allocated in the application; the real memory allocation occurs when the application then uses this address space by storing data into it. Since at that point there is no real way for the application to handle allocation failure, applications will be killed off by the kernel when such failure occurs (via the “OOM killer”). Overcommit can be disabled, theoretically, but to my dismay I have discovered recently that this doesn’t play well with cgroups (Linux’s resource control feature for process groups): an application in a cgroup that attempts to allocate more than the hard limit for the cgroup will generally be terminated, rather than have the allocation fail, regardless of the overcommit setting.

If the kernel doesn’t properly honour allocation requests, and will kill applications without warning when memory becomes exhausted, there’s certainly an argument to be made that there’s not much point for an application to try to be resilient to allocation failure.

But is this really how it should be?

I’m concerned, personally, about this notion that processes can just be killed off by the system. It rings false. We have these amazing machines at our disposal, with fantastic ability to precisely process data in whatever way and for whatever purpose we want – but, prone to sudden failures that cannot really be predicted or fully controlled, and which mean the system at a whole is fundamentally less reliable. Is it really ok that any process on the system might just be terminated? (Linux’s OOM uses heuristics to try and terminate the “right” process, but of course that doesn’t necessarily correspond to what the user or system administrator would want).

I’ve discussed desktops but the problem is still a problem on servers, perhaps more so; wouldn’t it be better if critical processes are able to detect and respond to memory scarcity rather than be killed off arbitrarily? Isn’t scaling back, at the application level, better than total failure, at least in some cases?

Linux could be fixed so that OOM was not needed on properly configured systems, even with cgroups; anyway there are other operating systems that, reportedly, have better behaviour. That would still leave the applications which don’t handle allocation failure, of course; fixing that would take (as well as a lot of work) a change in developer mindset. The thing is, while the odd application crash due to memory exhaustion probably doesn’t bother some, it certainly bothers me. Do we really trust that applications will reliably save necessary state at all times prior to crashing due to a malloc failure? Are we really ok with important system processes occasionally dying, with system functionality accordingly affected? Wouldn’t it be better if this didn’t happen?

I’d like to say no, but the current consensus would seem to be against me.


Addendum:

I tried really hard in the above to be clear how minimal a claim I was making, but there are comments that I’ve seen and discussions I’ve been embroiled in which make it clear this was not understood by at least by some readers. To sum up in what is hopefully an unambiguous fashion:

  • I believe some programs – not all, not even most – in some circumstances at least, need to or should be able to reliably handle an allocation failure. This is a claim I did not think would be contentious, and I haven’t been willing to argue it as that wasn’t the intention of the piece (but see below).
  • I’m aware of plenty of arguments (of varying quality) why this doesn’t apply to all programs (or even, why it doesn’t apply to a majority of programs). I haven’t argued, or claimed, that it does.
  • I’m critical of overcommit at the operating system level, because it severely impedes the possibility of handling allocation failure at the application level.
  • I’m also critical of languages and/or libraries, which make responding to allocation failure difficult or impossible. But (and admittedly, this is exception wasn’t explicit in the article) if used for an application where termination on allocation failure is acceptable, then this criticism doesn’t apply.
  • I’m interested in exploring language and API design ideas that could make handling allocation failure easier.

The one paragraph in particular that I think could possibly have caused confusion is this one:

I’m concerned, personally, about this notion that processes can just be killed off by the system. It rings false. We have these amazing machines at our disposal, with fantastic ability to precisely process data in whatever way and for whatever purpose we want – but, prone to sudden failures that cannot really be predicted or fully controlled, and which mean the system at a whole is fundamentally less reliable. Is it really ok that any process on the system might just be terminated?

Probably, there should have been emphasis on the “any” (in “any process on the system”) to make it clear what I was really saying here, and perhaps the “system at a whole is fundamentally less reliable” is unnecessary fluff.

There’s also a question in the concluding paragraph:

Do we really trust that applications will reliably save necessary state at all times prior to crashing due to a malloc failure?

This was a misstep and very much not the question I wanted to ask; I can see how it’s misleading. The right question was the one that follows it:

Are we really ok with important system processes occasionally dying, with system functionality accordingly affected? Wouldn’t it be better if this didn’t happen?

Despite those slips, I think if you read the whole article carefully the key thrust should be apparent.

For anyone wanting for a case where an application really does need to be able to handle allocation failures, I recently stumbled across one really good example:

To start with, I write databases for a living. I run my code on containers with 128MB when the user uses a database that is 100s of GB in size. Even if running on proper server machines, I almost always have to deal with datasets that are bigger than memory. Running out of memory happens to us pretty much every single time we start the program. And handling this scenario robustly is important to building system software. In this case, planning accordingly in my view is not using a language that can put me in a hole. This is not theoretical, that is real scenario that we have to deal with.

The other example is service managers, of which I am the primary author of one (Dinit), which is largely what got me thinking about this issue in the first place. A service manager has a system-level role and if one dies unexpectedly it potentially leaves the whole system in an awkward state (and it’s not in general possible to recover just be restarting the service manager). In the worst case, a program running as PID 1 on Linux which terminates will cause the kernel to panic. (The OOM killer will not target PID 1, but it still should be able to handle regular allocation failure gracefully). However, I’m aware of some service manager projects written using languages that will not allow handling allocation failure, and it concerns me.

Hammers and nails, and operator overloads

A response to “Spooky action at a distance” by Drew DeVault.

As Abraham Maslow said in 1966, “I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.”

Wikipedia, “Law of the Instrument

Our familiarity with particular tools, and the ways in which they work, predisposes us in our judgement of others. This is true also with programming languages; one who is familiar with a particular language, but not another, might tend to judge the latter unfavourably based on perceived lack of functionality or feature found in the former. Of course, it might turn out that such a lack is not really important, because there is another way to achieve the same result without that feature; what we should really focus on is exactly that, the end result, not the feature.

Drew Devault, in his blog post “Spooky action at a distance”, makes the opposite error: he takes a particular feature found in other languages, specifically, operator overloading, and claims that it leads to difficulty in understanding (various aspects of the relevant) code:

The performance characteristics, consequences for debugging, and places to look for bugs are considerably different than the code would suggest on the surface

Yes, in a language with operator overloading, an expression involving an operator may effectively resolve to a function call. DeVault calls this “spooky action” and refers to some (otherwise undefined) “distance” between an operator and its behaviour (hence “at a distance”, from his title).

DeVault’s hammer, then, is called “C”. And if another language offers greater capability for abstraction than C does, that is somehow “spooky”; code written that way is a bent nail, so to speak.

Let’s look at his follow-up example about strings:

Also consider if x and y are strings: maybe “+” means concatenation? Concatenation often means allocation, which is a pretty important side-effect to consider. Are you going to thrash the garbage collector by doing this? Is there a garbage collector, or is this going to leak? Again, using C as an example, this case would be explicit:

I wonder about the point of the question “is there a garbage collector, or is this going to leak?” – does DeVault really think that the presence or absence of a garbage collector can be implicit in a one-line code sample? Presumably he does not furthermore really believe that lack of a garbage collector would necessitate a leak, although that’s implied by the unfortunate phrasing. Ironically, the C code he then provides for concatenating strings does leak – there’s no deallocation performed at all (nor is there any checking for allocation failure, potentially causing undefined behaviour when the following lines execute).

Taking C++, we could write the string concatenation example as:

std::string newstring = x + y;

Now look again at the questions DeVault posed. First, does the “+” mean concatenation? It’s true that this is not certain from this one line of code alone, since in fact it depends on the types of x and y, but there is a good chance it does, and we can anyway tell by looking at the surrounding code, which of course we need to do anyway in order to truly understand what this code is doing (and why) regardless of what language it is written in. I’ll add that even if it does turn out to be difficult to determine the types of the operands from inspecting the immediately surrounding code, this is probably an indication of badly written (or badly documented) code*.

Any C++ systems programmer, with only a modest amount of experience, would also almost certainly know that string concatenation may involve heap allocation. There’s no garbage collector (although C++ allows for one, it is optional, and I’m not aware of any implementations that provide one). True, there’s still no check for allocation failure, though here it would throw an exception and most likely lead to (defined) imminent program termination instead of undefined behaviour. (Yes, the C code most likely would also terminate the program immediately if the allocation failed; but technically this is not guaranteed; and, a C programmer should know not to assume that undefined behaviour in a C program will actually behave in some certain way, despite that they might believe that they know how their code should be translated by the compiler).

So, we reduced the several-line C example to a single line, which is straight-forward to read and understand, and for which we do in fact have ready answers to the questions posed by DeVault (who seems to be taking the tack that the supposed difficulty of answering these questions contributes to a case against operator overloading).

Importantly, there’s also no memory leak, unlike in the C code, since the string destructor will perform any necessary deallocation. Would the destructor call (occurring when the string goes out of scope) also count as “spooky action at a distance”? I guess that it should, according to DeVault’s definition, although that is a bit too fuzzy to be sure. Is this “spooky action” problematic? No, it’s downright helpful. It’s also not really spooky, since as a C++ programmer, we expect it.

It’s true that C’s limitations often force code to be written in such a way that low-level details are exposed, and that this can make it easier to follow control flow, since everything is explicit. In particular, lack of user-defined operator overloading, combined with lack of function overloading, mean that types often become explicit when variables are used (the argument to strlen is, presumably, a string). But it’s easy to argue – and I do – that this doesn’t really matter. Abstractions such as operator overloading exist for a reason; in many cases they aid in code comprehension, and they don’t really obscure details (such as allocation) that DeVault suggests they do.

As a counter-example to DeVaults first point, consider:

x + foo()

This is a very brief line of C code, but now we can’t say whether it performs allocation, nor talk about performance characteristics or so-forth, without looking at other parts of the code.

We got to the heart of the matter earlier on: you don’t need to understand everything about what a line of code does by looking at that line in isolation. In fact, it’s hard to see how a regular function call (in C or any other language) doesn’t in fact also qualify as “spooky action at a distance”, unless you take the stance that, since it is a function call, we know that it goes off somewhere else in the code, whereas for an “x + y” expression we don’t – but then you’re also wielding C as your hammer: the only reason you think that an operator doesn’t involve a call to a function is because you’re used to a language where it doesn’t.


* If at this stage you want to argue “but C++ makes it easy to write bad code”, be aware that you’ve gone off on a tangent; this is not a discussion about the merits or lack-thereof of C++ as a whole, we’re just using it as an example here for a discussion on operator overloading.

Escape from System D, episode VII

Summary: Dinit reaches alpha; Alpine linux demo image; Booting FreeBSD

Well, it’s been an awfully long time since I last blogged about Dinit (web page, github), my service-manager / init / wannabe-Systemd-competitor. I’d have to say, I never thought it would take this long to come this far; when I started the project, it didn’t seem such a major undertaking, but as is often the case with hobby projects, life started getting in the way.

In an earlier episode, I said:

Keeping the momentum up has been difficult, and there’s been some longish periods where I haven’t made any commits. In truth, that’s probably to be expected for a solo, non-funded project, but I’m wary that a month of inactivity can easily become three, then six, and then before you know it you’ve actually stopped working on the project (and probably started on something else). I’m determined not to let that happen – Dinit will be completed. I think the key is to choose the right requirements for “completion” so that it can realistically happen; I’ve laid out some “required for 1.0” items in the TODO file in the repository and intend to implement them, but I do have to restrain myself from adding too much. It’s a balance between producing software that you are fully happy with and that feels complete and polished.

This still holds. On the positive side, I have been chipping away at those TODOs; on the other hand I still occasionally find myself adding more TODOs, so it’s a little hard to measure progress.

But, I released a new version just recently, and I’m finally happy to call Dinit “alpha stage” software. Meaning, in this case, that the core functionality is really complete, but various planned supporting functionality is still missing.

I myself have been running Dinit as the init and primary service manager on my home desktop system for many years now, so I’m reasonably confident that it’s solid. When I do find bugs now, they tend to be minor mistakes in service management functions rather than crashes or hangs. The test suite has become quite extensive and proven very useful in finding regressions early.

Alpine VM image

I decided to try creating a VM image that I could distribute to anyone who wanted to see Dinit in action; this would also serve as an experiment to see if I could create a system based on a distribution that was able to boot via Dinit. I wanted it to be small, and one candidate that immediately came to mind was Alpine linux.

Alpine is a Musl libc based system which normally uses a combination of Busybox‘s init and OpenRC service management (historically, Systemd couldn’t be built against Musl; I don’t know if that’s still the case. Dinit has no issues). Alpine’s very compact, so it fits the bill nicely for a base system to use with Dinit.

After a few tweaks to the example service definitions (included in the Dinit source tree), I was able to boot Alpine, including bring up the network, sshd and terminal login sessions, using Dinit! The resulting image is here, if you’d like to try it yourself.

Login screen presented after booting with Dinit
Running “dinitctl list” command on Alpine

(The main thing I had to deal with was that Alpine uses mdev, rather than udev, for device tree management. This meant adapting the services that start udev, and figuring out to get the kernel modules loaded which were necessary to drive the available hardware – particularly, the ethernet driver! Fortunately I was able to inspect and borrow from the existing Alpine boot scripts).

Booting FreeBSD

A longer-term goal has always been to be able to use Dinit on non-Linux systems, in particular some of the *BSD variants. Flushed with success after booting Alpine, I thought I’d also give BSD a quick try (Dinit has successfully built and run on a number of BSDs for some time, but it hasn’t been usable as the primary init on such systems).

Initially I experimented with OpenBSD, but I quickly gave up (there is no way that I could determine to boot an alternative init using OpenBSD, which meant that I had to continuously revert to a backup image in order to be able to boot again, every time I got a failure; also, I suspect that the init executable on OpenBSD needs to be statically linked). Moving on to FreeBSD, I found it a little easier – I could choose an init at boot time, so it was easy to switch back-and-forth between dinit and the original init.

However, dinit was crashing very quickly, and it took a bit of debugging to discover why. On Linux, init is started with three file descriptors already open and connected to the console – these are stdin (0), stdout (1) and stderr (2). Then, pretty much the first thing that happens when dinit starts is that it opens an epoll set, which becomes the next file descriptor (3); this actually happens during construction of the global “eventloop” variable. Later, to make sure they are definitely connected to the console, dinit closes file descriptors 0, 1, and 2, and re-opens them by opening the /dev/console device.

Now, on FreeBSD, it turns out that init starts without any file descriptors open at all! The event loop uses kqueue on FreeBSD rather than the Linux-only epoll, but the principle is pretty much the same, and because it is created early it gets assigned the first available file descriptor which in this case happens to be 0 (stdin). Later, Dinit unwittingly closes this so it can re-open it from /dev/console. A bit later still, when it tries to use the kqueue for event polling, disaster strikes!

This could be resolved by initialising the event lop later on, after the stdin/out/err file descriptors were open and connected. Having done that, I was also able to get FreeBSD to the point where it allowed login on a tty! (there are some minor glitches, and in this case I didn’t bother trying to get network and other services running; that can probably wait for a rainy day – but in principle it should be possible!).

Image
FreeBSD booting with Dinit (minimal services; straight to login!)

Wrap-up

So, Dinit has reached alpha release, and is able to boot Alpine Linux and FreeBSD. This really feels like progress! There’s still some way to go before a 1.0 release, but we’re definitely getting closer. If you’re interested in Dinit, you might want to try out the Alpine-Dinit image, which you can run with QEMU.

Is C++ type-safe? (There’s two right answers)

I recently allowed myself to be embroiled in an online discussion regarding Rust and C++. It started with a comment (from someone else) complaining how Rust advocates have a tendency to hijack C++ discussions and suggesting that C++ was type-safe, which was responded to by a Rust advocate first saying that C++ wasn’t type-safe (because casts, and unchecked bounds accesses, and unchecked lifetime), and then going on to make an extreme claim about C++’s type system which I won’t repeat here because I don’t want to re-hash that particular argument. Anyway, I weighed in trying to make the point that it was a ridiculous claim, but also made the (usual) mistake of also picking at other parts of the comment, in this case regarding the type-safety assertion, which is thorny because I don’t know if many people really understand properly what “type-safety” is (I think I somewhat messed it up myself in that particular conversation).

So what exactly is “type-safety”? Part of the problem is that it is an overloaded term. The Rust advocate picked some parts of the definition from the wikipedia article and tried to use these to show that C++ is “not type-safe”, but they skipped the fundamental introductory paragraph, which I’ll reproduce here:

In computer science, type safety is the extent to which a programming language discourages or prevents type errors

https://en.wikipedia.org/wiki/Type_safety

I want to come back to that, but for now, also note that it offers this, on what constitutes a type error:

A type error is erroneous or undesirable program behaviour caused by a discrepancy between differing data types for the program’s constants, variables, and methods (functions), e.g., treating an integer (int) as a floating-point number (float).

… which is not hugely helpful because it doesn’t really say it means to “treat” a value of one type as another type. It could mean that we supply a value (via an expression) that has a type not matching that required by an operation which is applied to it, though in that case it’s not a great example, since treating an integer as a floating point is, in many languages, perfectly possible and unlikely to result in undesirable program behaviour; it could perhaps also be referring to type-punning, the process of re-interpreting a bit pattern which represents a value on one type as representing a value in another type. Again, I want to come back to this, but there’s one more thing that ought to be explored, and that’s the sentence at the end of the paragraph:

The formal type-theoretic definition of type safety is considerably stronger than what is understood by most programmers.

I found quite a good discussion of type-theoretic type safety in this post by Thiago Silva. They discuss two definitions, but the first (from Luca Cardelli) at least boils down to “if undefined behaviour is invoked, a program is not type-safe”. Now, we could extend that to a language, in terms of whether the language allows a non-type-safe program to be executed, and that would make C++ non-type-safe. However, also note that this form of type-safety is a binary: a language either is or is not type-safe. Also note that the definition here allows a type-safe program to raise type errors, in contrast to the introductory statement from wikipedia, and Silva implies that a type error occurs when an operation is attempted on a type to which it doesn’t apply, that is, it is not about type-punning:

In the “untyped languages” group, he notes we can see them equivalently as “unityped” and, since the “universal type” type checks on all operations, these languages are also well-typed. In other words, in theory, there are no forbidden errors (i.e. type errors) on programs written in these languages

Thiago Silva

I.e. with dynamic typing “everything is the same type”, and any operation can be applied to any value (though doing so might provoke an error, depending on what the value represents), so there’s no possibility of type error, because a type error occurs when you apply an operation to a type for which it is not allowed.

The second definition discussed by Silva (i.e. that of Benjamin C. Pierce) is a bit different, but can probably be fundamentally equated with the first (consider “stuck” as meaning “has undefined behaviour” when you read Silva’s post).

This notion of type error as an operation illegal on certain argument type(s) is also supported by a quote from the original wiki page:

A language is type-safe if the only operations that can be performed on data in the language are those sanctioned by the type of the data.

Vijay Saraswat

So where are we? In formal type-theoretic language, we would say that:

  • type safety is (confusingly!) concerned with whether a program has errors which result in arbitrary (undefined) behaviour, and not so much about type errors
  • in fact, type errors may be raised during execution of a type-safe program.
  • C++ is not type-safe, because it has undefined behaviour

Further, we have a generally-accepted notion of type error:

  • a type error is when an attempt is made to apply an operation to a type of argument to which it does not apply

(which, ok, makes the initial example of a type error on the wikipedia page fantastically bad, but is not inconsistent with the page generally).

Now, let me quote the introductory sentence again, with my own emphasis this time:

In computer science, type safety is the extent to which a programming language discourages or prevents type errors

This seems to be more of a “layman’s definition” of type safety, and together with the notion of type error as outlined above, certainly explains why the top-voted stackoverflow answer for “what is type-safe?” says:

Type safety means that the compiler will validate types while compiling, and throw an error if you try to assign the wrong type to a variable

That is, static type-checking certainly is designed to prevent operations that are illegal according to argument type from being executed, and thus have a degree of type-safety.

So, we have a formal definition of type-safety, which in fact has very little to do with types within a program and more to do with (the possibility of) undefined behaviour; and we have a layman’s definition, which says that type-safety is about avoiding type errors.

The formal definition explains why you can easily find references asserting that C++ is not type-safe (but that Java, for example, is). The informal definition, on the other hand, clearly allows us to say that C++ has reasonably good type-safety.

Clearly, it’s a bit of a mess.

How to resolve this? I guess I’d argue that “memory-safe” is a better understood term than the formal “type-safe”, and since in many cases lack of the latter results from lack of the former we should just use it as the better of the two (or otherwise make specific reference to “undefined behaviour”, which is probably also better understood and less ambiguous). For the layman’s variant we might use terms like “strongly typed” and “statically type-checked”, rather than “type-safe”, depending on where exactly we think the type-safety comes from.

Escape from System D, episode VI: freedom in sight

I don’t write often enough about my init-system-slash-service-manager, Dinit (https://github.com/davmac314/dinit). Lots of things have happened since I began writing it, and this year I’m in a new country with a new job, and time to work on just-for-the-hell-of-it open-source projects is limited. And of course, writing blog posts detracts from time that could be spent writing code.

But the truth is: it’s come a long way.

Dinit has been booting my own system for a long while, and other than a few hiccups on odd occasions it’s been quite reliable. But that’s just my own personal experience and hardly evidence that it’s really as robust and stable as I’d like to claim it is. On the other hand, it’s now got a pretty good test suite, it’s in the OpenBSD ports tree, and it still occasionally has Fedora RPMs built, so it’s possible there are other users out there (I know of only one other person who definitely uses Dinit on any sort of regular basis, and that’s not as their system init). I’ve ran static analysis on Dinit and fixed the odd few problems that were reported. I’ve fuzz-tested the control protocol.

Keeping up motivation is hard, and finding time is even harder, but I still make slow progress. I released another version recently, and it’s got some nice new features that will make using it a better experience.

Ok, compared to Systemd it lacks some features. It doesn’t know anything about Cgroups, the boot manager, filesystem mounts, dynamic users or binary logging. For day-to-day use on my personal desktop system, none of this matters, but then, I’m running a desktop based on Fluxbox and not much else; if I was trying to run Gnome, I’d rather expect that some things might not work quite as intended (on the other hand, maybe I could set up Elogind and it would all work fine… I’ve not tried, yet).

On the plus side, compared to Systemd’s binary at 1.5mb, Dinit weighs in at only 123kb. It’s much smaller, but fundamentally almost as powerful, in my own opinion, as the former. Unlike Systemd, it works just fine with alternative C libraries like Musl, and it even works (though not with full support for running as init, yet) on other operating systems such as FreeBSD and OpenBSD. It should build, in fact, on just about any POSIX-compliant system, and it doesn’t require any dependencies (other than an event loop library which is anyway bundled in the tarball). It’ll happily run in a container, and doesn’t care if it’s not running as PID 1. (I’ll add Cgroups support at some point, though it will always be optional. I’m considering build time options to let it be slimmed down even from the current size). What it needs more than anything is more users.

Sometimes I feel like there’s no hope of avoiding a Systemd monoculture, but occasionally there’s news that shows that other options remain alive and well. Debian is having a vote on whether to continue to support other init systems, and to what extent; we’ll see soon enough what the outcome is. Adélie linux recently announced support for using Laurent Bercot’s S6-RC (an init alternative that’s certainly solid and which deserves respect, though it’s a little minimalist for my own taste). Devuan continues to provide a Systemd-free variant of Debian, as Obarun does for Arch Linux. I’d love to have a distribution decide to give Dinit a try, but of course I have to face the possibility that this will never happen.

I’ll end with a plea/encouragement: if you’re interested in the project at all, please do download the source, build it (it’s easy, I promise!), perhaps configure services and get it to run. And let me know! I’m happy to receive constructive feedback (even if I won’t agree with it, I want to hear it!) and certainly would like to know if you have any problem building or using it, but even if you just take a quick peek at the README and a couple of source files, feel feel to drop me a note.

Thoughts on password prompts and secure desktop environments

I’ve been thinking a little lately about desktop security – what makes a desktop system (with a graphical interface) secure or insecure? How is desktop security supposed to work, in particular on a unix-y system (Linux or one of the BSDs, for example)?

A quite common occurrence on today’s systems is to be prompted for your password—or perhaps for “an administrator” password—when you try, from the desktop environment, to perform some action that requires extended privileges; probably the most common example would be installing a new package, another is changing system configuration such as network settings. The two cases of asking for your own password or for another one are actually different in ways that might not initially be obvious. Let’s look at the first case: You have already logged in; your user credentials are supposedly established; why then is your password required?. There is an assumption that you are allowed to perform the requested action (otherwise your ability to enter your own password should make no difference). The only reason that I see for prompting for a password, then, is to ensure that:

  1. The user sitting in the seat is still the same user who logged in, i.e. it’s not the case that another individual has taken advantage of you forgetting to log out or lock the screen before you walked away; and
  2. The action is indeed being knowingly requested by the user, and not for instance by some rogue software running in the user’s session. By prompting for a password, the system is alerting the user to the fact that a privileged action has been requested.

Both of these are clearly in the category of mitigation—the password request is designed to limit the damage/further intrusion that can be performed by an already compromised account. But are they really effective? I’m not so sure about this, particularly with current solutions, and they may introduce other problems. In particular I find the problem of secure password entry problematic. Consider again:

  1. We ask the user to enter their password to perform certain actions
  2. We do this because we assume the account may be compromised

There’s an implicit assumption, then, that the user is able to enter their password and have it checked by some more privileged part of the system, without another process which is running as the same user being able to see the password (if they could see the password, they could enter it to accomplish the actions we are trying to prevent them from performing). This is only likely to be possible if the display system itself (eg the X server) is running as a different user* (though not necessarily as root), and that it provides facilities to enable secure input without another process eavesdropping, and that the program requesting the password is likewise also running as a separate user—otherwise, there’s little to stop a malicious actor from connecting to the relevant process with a debugger and observing all input. In that case, forcing the user to enter their password is (a) not necessarily going to prevent an attacker from performing the protected actions anyway, and, worse, (b) actually making it easier for an attacker to recover the users password by forcing them to enter it in contexts where it can be observed by other processes.

* Running as a different user is necessary since otherwise the process can be attached via ptrace, eg. a debugger. I’ll note at this point that more recent versions of Mac OS no longer arbitrary programs to ptrace another process; debugger executables must be signed with a certificate which gives them this privilege.

Compare this to the second case, where you must enter a separate password (eg the root password) to perform a certain action. The implicit assumption here is different: your user account doesn’t have permission to perform the action, and the allowance for entering a password is to cover the case where either (a) you actually are an administrator but are currently using an unprivileged account or (b) another, privileged, user is willing to supply their password to allow for a particular action to be invoked from your account on a one-off basis. The assumption that your account may be in the hands of a malicious actor is no longer necessary (although of course it may well still be the case).

So which is better? The first theoretically mitigates compromised user accounts, but if not done properly has little efficacy and in fact leads to potential password leakage, which is arguably an even worse outcome. The second at least has additional utility in that it can grant access to functions not available to the current user, but if used as a substitute for the first (i.e. if used routinely by a user to perform actions for which their account lacks suitable privileges) then it suffers the same problems, and is in fact worse since it potentially leaks an administrator password which isn’t tied to the compromised account.

Note that, given full compromise of an account, it would anyway be fairly trivial to pop up an authentication window in an attempt to trick the user into supplying their password. Full mitigation of this could be achieved by requiring the disciplined use a SaK (secure attention key) which has seemingly gone out of favour (the Linux SaK support would kill the X server when pressed, which makes it useless in this context anyway). Another possibility for mitigation would be to show the user a consistent secret image or phrase when prompting them for authentication, so they knew that the request came from the system; this would ideally be done in such a way that prevented other programs from grabbing the screen or otherwise recovering the image. Again, with X currently, I believe this may be difficult or impossible, but could be done in principle with an appropriate X extension or other modification of the X server.

To summarise, prompting the user for a password to perform certain actions only increases security if done carefully and with certain constraints. The user should be able to verify that a password request comes from the system, not an arbitrary process; additionally, no other process running with user privileges should be able to intercept password entry. Without meeting these constraints, prompting for a password accomplishes two things: First, it makes it more complex (but does not make it impossible, generally) for a compromised process to issue a command which the user has privilege but which is behind an ask-password barrier. Secondly, it prevents an opportunistic person, who already has physical access to the machine, from issuing such commands when the real user has left their machine unattended. These are perhaps good things to achieve (I’d argue the second is largely useless), but in this case they come with a cost: inconvenience to the user, who has to enter their password more often that would otherwise be necessary, and potentially making it easier for sophisticated attackers to obtain the user password (or worse, that of an administrator).

Given the above, I’m thinking that current Linux desktop systems which prompt for a password to initiate certain actions are actually doing the wrong thing.

Edit: I note that Linux distributions may disallow arbitrary ptrace, and also that ptrace can be disabled via prctl() (though this seems like it would be race-prone). It’s still not clear to me that asking for a password with X is secure; I guess that XGrabKeyboard is supposed to make it so. This still leaves the possibility of displaying a fake password entry dialog, though, and tricking the user into supplying their password that way.