Escape from System D, episode VI: freedom in sight

I don’t write often enough about my init-system-slash-service-manager, Dinit (https://github.com/davmac314/dinit). Lots of things have happened since I began writing it, and this year I’m in a new country with a new job, and time to work on just-for-the-hell-of-it open-source projects is limited. And of course, writing blog posts detracts from time that could be spent writing code.

But the truth is: it’s come a long way.

Dinit has been booting my own system for a long while, and other than a few hiccups on odd occasions it’s been quite reliable. But that’s just my own personal experience and hardly evidence that it’s really as robust and stable as I’d like to claim it is. On the other hand, it’s now got a pretty good test suite, it’s in the OpenBSD ports tree, and it still occasionally has Fedora RPMs built, so it’s possible there are other users out there (I know of only one other person who definitely uses Dinit on any sort of regular basis, and that’s not as their system init). I’ve ran static analysis on Dinit and fixed the odd few problems that were reported. I’ve fuzz-tested the control protocol.

Keeping up motivation is hard, and finding time is even harder, but I still make slow progress. I released another version recently, and it’s got some nice new features that will make using it a better experience.

Ok, compared to Systemd it lacks some features. It doesn’t know anything about Cgroups, the boot manager, filesystem mounts, dynamic users or binary logging. For day-to-day use on my personal desktop system, none of this matters, but then, I’m running a desktop based on Fluxbox and not much else; if I was trying to run Gnome, I’d rather expect that some things might not work quite as intended (on the other hand, maybe I could set up Elogind and it would all work fine… I’ve not tried, yet).

On the plus side, compared to Systemd’s binary at 1.5mb, Dinit weighs in at only 123kb. It’s much smaller, but fundamentally almost as powerful, in my own opinion, as the former. Unlike Systemd, it works just fine with alternative C libraries like Musl, and it even works (though not with full support for running as init, yet) on other operating systems such as FreeBSD and OpenBSD. It should build, in fact, on just about any POSIX-compliant system, and it doesn’t require any dependencies (other than an event loop library which is anyway bundled in the tarball). It’ll happily run in a container, and doesn’t care if it’s not running as PID 1. (I’ll add Cgroups support at some point, though it will always be optional. I’m considering build time options to let it be slimmed down even from the current size). What it needs more than anything is more users.

Sometimes I feel like there’s no hope of avoiding a Systemd monoculture, but occasionally there’s news that shows that other options remain alive and well. Debian is having a vote on whether to continue to support other init systems, and to what extent; we’ll see soon enough what the outcome is. Adélie linux recently announced support for using Laurent Bercot’s S6-RC (an init alternative that’s certainly solid and which deserves respect, though it’s a little minimalist for my own taste). Devuan continues to provide a Systemd-free variant of Debian, as Obarun does for Arch Linux. I’d love to have a distribution decide to give Dinit a try, but of course I have to face the possibility that this will never happen.

I’ll end with a plea/encouragement: if you’re interested in the project at all, please do download the source, build it (it’s easy, I promise!), perhaps configure services and get it to run. And let me know! I’m happy to receive constructive feedback (even if I won’t agree with it, I want to hear it!) and certainly would like to know if you have any problem building or using it, but even if you just take a quick peek at the README and a couple of source files, feel feel to drop me a note.

Thoughts on password prompts and secure desktop environments

I’ve been thinking a little lately about desktop security – what makes a desktop system (with a graphical interface) secure or insecure? How is desktop security supposed to work, in particular on a unix-y system (Linux or one of the BSDs, for example)?

A quite common occurrence on today’s systems is to be prompted for your password—or perhaps for “an administrator” password—when you try, from the desktop environment, to perform some action that requires extended privileges; probably the most common example would be installing a new package, another is changing system configuration such as network settings. The two cases of asking for your own password or for another one are actually different in ways that might not initially be obvious. Let’s look at the first case: You have already logged in; your user credentials are supposedly established; why then is your password required?. There is an assumption that you are allowed to perform the requested action (otherwise your ability to enter your own password should make no difference). The only reason that I see for prompting for a password, then, is to ensure that:

  1. The user sitting in the seat is still the same user who logged in, i.e. it’s not the case that another individual has taken advantage of you forgetting to log out or lock the screen before you walked away; and
  2. The action is indeed being knowingly requested by the user, and not for instance by some rogue software running in the user’s session. By prompting for a password, the system is alerting the user to the fact that a privileged action has been requested.

Both of these are clearly in the category of mitigation—the password request is designed to limit the damage/further intrusion that can be performed by an already compromised account. But are they really effective? I’m not so sure about this, particularly with current solutions, and they may introduce other problems. In particular I find the problem of secure password entry problematic. Consider again:

  1. We ask the user to enter their password to perform certain actions
  2. We do this because we assume the account may be compromised

There’s an implicit assumption, then, that the user is able to enter their password and have it checked by some more privileged part of the system, without another process which is running as the same user being able to see the password (if they could see the password, they could enter it to accomplish the actions we are trying to prevent them from performing). This is only likely to be possible if the display system itself (eg the X server) is running as a different user* (though not necessarily as root), and that it provides facilities to enable secure input without another process eavesdropping, and that the program requesting the password is likewise also running as a separate user—otherwise, there’s little to stop a malicious actor from connecting to the relevant process with a debugger and observing all input. In that case, forcing the user to enter their password is (a) not necessarily going to prevent an attacker from performing the protected actions anyway, and, worse, (b) actually making it easier for an attacker to recover the users password by forcing them to enter it in contexts where it can be observed by other processes.

* Running as a different user is necessary since otherwise the process can be attached via ptrace, eg. a debugger. I’ll note at this point that more recent versions of Mac OS no longer arbitrary programs to ptrace another process; debugger executables must be signed with a certificate which gives them this privilege.

Compare this to the second case, where you must enter a separate password (eg the root password) to perform a certain action. The implicit assumption here is different: your user account doesn’t have permission to perform the action, and the allowance for entering a password is to cover the case where either (a) you actually are an administrator but are currently using an unprivileged account or (b) another, privileged, user is willing to supply their password to allow for a particular action to be invoked from your account on a one-off basis. The assumption that your account may be in the hands of a malicious actor is no longer necessary (although of course it may well still be the case).

So which is better? The first theoretically mitigates compromised user accounts, but if not done properly has little efficacy and in fact leads to potential password leakage, which is arguably an even worse outcome. The second at least has additional utility in that it can grant access to functions not available to the current user, but if used as a substitute for the first (i.e. if used routinely by a user to perform actions for which their account lacks suitable privileges) then it suffers the same problems, and is in fact worse since it potentially leaks an administrator password which isn’t tied to the compromised account.

Note that, given full compromise of an account, it would anyway be fairly trivial to pop up an authentication window in an attempt to trick the user into supplying their password. Full mitigation of this could be achieved by requiring the disciplined use a SaK (secure attention key) which has seemingly gone out of favour (the Linux SaK support would kill the X server when pressed, which makes it useless in this context anyway). Another possibility for mitigation would be to show the user a consistent secret image or phrase when prompting them for authentication, so they knew that the request came from the system; this would ideally be done in such a way that prevented other programs from grabbing the screen or otherwise recovering the image. Again, with X currently, I believe this may be difficult or impossible, but could be done in principle with an appropriate X extension or other modification of the X server.

To summarise, prompting the user for a password to perform certain actions only increases security if done carefully and with certain constraints. The user should be able to verify that a password request comes from the system, not an arbitrary process; additionally, no other process running with user privileges should be able to intercept password entry. Without meeting these constraints, prompting for a password accomplishes two things: First, it makes it more complex (but does not make it impossible, generally) for a compromised process to issue a command which the user has privilege but which is behind an ask-password barrier. Secondly, it prevents an opportunistic person, who already has physical access to the machine, from issuing such commands when the real user has left their machine unattended. These are perhaps good things to achieve (I’d argue the second is largely useless), but in this case they come with a cost: inconvenience to the user, who has to enter their password more often that would otherwise be necessary, and potentially making it easier for sophisticated attackers to obtain the user password (or worse, that of an administrator).

Given the above, I’m thinking that current Linux desktop systems which prompt for a password to initiate certain actions are actually doing the wrong thing.

Edit: I note that Linux distributions may disallow arbitrary ptrace, and also that ptrace can be disabled via prctl() (though this seems like it would be race-prone). It’s still not clear to me that asking for a password with X is secure; I guess that XGrabKeyboard is supposed to make it so. This still leaves the possibility of displaying a fake password entry dialog, though, and tricking the user into supplying their password that way.

Bad utmp implementations in Glibc and FreeBSD

I recently released another version – 0.5.0 – of Dinit, the service manager / init system. There were a number of minor improvements, including to the build system (just running “make” or “gmake” should be enough on any of the systems which have a pre-defined configuration, no need to edit mconfig by hand), but the main features of the release were S6-compatible readiness notification, and support for updating the utmp database.

At this point, I’d expect, there might be one or two readers wondering what this “utmp” database might be. On Linux you can find out easily enough via “man utmp” in the terminal:

The utmp file allows one to discover information about who is currently
using the system. There may be more users currently using the system,
because not all programs use utmp logging.

The OpenBSD man page clarifies:

The utmp file is used by the programs users(1), w(1) and who(1).

In other words, utmp is a record of who is currently logged in to the system (another file, “wtmp”, records all logins and logouts, as well as, potentially, certain system events such as reboots and time updates). This is a hint at the main motivation for having utmp support in Dinit – I wanted the “who” command to correctly report current logins (and I wanted boot time to be correctly recorded in the wtmp file).

However, when I began to implement the support for utmp and wtmp in Dinit, I also started to think about how these databases worked. I knew already that they were simply flat file databases – i.e. each record is a fixed number of bytes, the size of the “struct utmp” structure. The files are normally readable by unprivileged users, so that utilities such as who(1) don’t need to be setuid/setgid. Updating and reading the database is done (behind the scenes) via normal file system read and writes, via the getutent(3)/pututline(3) family of functions, their getutxent/pututxline POSIX equivalents, or by the higher-level login(3) and logout(3) functions (found in libutil; In OpenBSD, only the latter are available, the lower-level routines don’t exist).

I wondered: If the files consist of fixed-sized records, and are readable by regular users, how is consistency maintained? That is – how can a process ensure that, when it updates the database, it doesn’t conflict with another process also attempting to update the database at the same time? Similarly, how can a process reading an entry from the database be sure that it receives a consistent, full record and not a record which has been partially updated? (after all, POSIX allows that a write(2) call can return without having written all the requested bytes, and I’m not aware of Linux or any of the *BSDs documenting that this cannot happen for regular files). Clearly, some kind of locking is needed; a process that wants to write to or read from the database locks it first, performs its operation, and then unlocks the database. Once again, this happens under the hood, in the implementation of the getutent/pututline functions or their equivalents.

Then I wondered: if a user process is able to lock the utmp file, and this prevents updates, what’s to stop a user process from manually acquiring and then holding such a lock for a long – even practically infinite – duration? This would prevent the database from being updated, and would perhaps even prevent logins/logouts from completing. Unfortunately, the answer is – nothing; and yes, it is possible on different systems to prevent the database from being correctly updated or even to prevent all other users – including root – from logging in to the system.

Specifically:

  • On Linux with Glibc (or, I suppose, any other system with Glibc), updates to the database can be prevented completely, and logins can be delayed by 10 seconds (bug filed);
  • On FreeBSD, updates to the database can be prevented and logins prevented indefinitely (bug filed). Note that on FreeBSD the file is named “utx.active” but is otherwise the same as “utmp” on other systems. A patch was quickly put together after I filed this bug, but progress on it has seemingly stalled.

I haven’t checked all other systems but suspect that various other BSDs could be susceptible to related problems. On the other hand, some systems are immune:

  • Linux with Musl, because Musl doesn’t implement the utmp functions (though it has no-op stubs). I don’t understand why the Musl FAQ claims that you need a setuid program to update the database: it seems perfectly reasonable to simply limit modification to daemons already running as root or in a particular group. (Perhaps it is referring to having terminal emulators create utmp entries, which the Linux “utmp” manpage suggests is something that happens, though this also seems unnecessary to me).
  • OpenBSD structures the utmp file so there is one particular entry per tty device, and so avoids the need for locking (writes to the same tty entry should naturally be serialised, since they are either for login or logout). It performs no locking for reading, which leaves open the possibility of reading a partially written entry, though this is certainly a less severe problem than the ones affecting Glibc/FreeBSD.

The whole thing isn’t an issue for single-user systems, but for multiple-user systems it is more of a concern. On such systems, I’d recommend making /var/run/utmp and /var/run/wtmp (or their equivalents) readable only by the owner and group, or removing them altogether, and forgoing the ability for unprivileged users to run the “who” command. Otherwise, you risk users being able to deny logins or prevent them being recorded, as per above.

As for fixes which still allow unprivileged processes to read the database, I’ve come to the conclusion that the best option is to use locking (on a separate, root-only file) only for write operations, and live with the limitation that it is theoretically possible for a program to read a partially-updated entry; this seems unlikely to ever happen, let alone actually cause a significant problem, in practice. To completely solve the problem, you’d either need atomic read and write support on files, or a secondary mechanism for accessing the database which obviated the concurrency problem (eg access the database via communication with a running daemon which can serialize requests). Or, perhaps Musl is taking the right approach by simply excluding the functionality.

On the vagaries of init systems

When I started working on Dinit I had only a fairly vague idea of the particulars of various other init systems, being familiar mainly with Sys V init and to a lesser extent, Systemd and Upstart (the latter of which has more-or-less vanished off the face of the earth). At that stage it was a purely personal project and I didn’t count necessarily making it public; as time went on I heard lots of complaints about Systemd, which has become the choice of init system of many distributions; I did a little research on some other systems – enough to satisfy myself that Dinit filled a worthwhile niche – and then made an announcement that I was planning to develop it into a(nother) complete init/service manager that could potentially compete with Systemd.

Around that time, I also wrote a short document trying to summarise the differences between a number of extant systems, or at least between them and Dinit, and included this in the documentation of Dinit (as part of the source tree). However, the time has perhaps come to write a more comprehensive treatment examining the differing design choices of various systems; hence, this post. Hopefully I can give an interesting overview of some design decisions that are made in a service manager, highlight specific features of various particular pieces of service management software, and give some incidental background on why I’ve made the choices I have in the design of Dinit (though I’ll to try to keep this from being too Dinit-focused).

Recap: supervision system vs service manager vs system manager

The various terms – supervision, service manager, system manager – sometimes get thrown around a little loosely, but for my purposes here it’s better to have a clear distinction between them. Without further ado:

Supervision system: a process or means for supervising service processes, providing a means to start and terminate individual services and perhaps to automatically restart them if they terminate unexpectedly.

Into the category of supervision system falls the likes of daemon-tools, runit and S6. Note that a supervision system need not be made up of just a single process: it might supervise individual service processes using separate supervisor processes, for example. Also, an active “service” might not necessarily correspond to a running process (for example a “network” service could be made active by executing a script which terminates after the network interfaces are configured).

The next category is that of service manager:

Service manager: a process or means for starting or stopping services which have dependencies from and to other services, such that the dependencies of a service must be started before the service itself is started, and the dependents of a service should be stopped before the service itself is stopped.

So, compared to a supervision system, this adds the concept of dependency management. Some might disagree that “service manager” should entail dependency handling, but for our purposes here it’s useful to have a convenient name for such a distinction, so we make the separation – dependency-handling service management versus individual service supervision.

Note that it may be possible to implement a service manager as an additional component on top of a separate supervision system – for example, S6-RC and Anopa both implement service management over the S6 supervision system.

This brings us to the final category:

System manager: a process (or processes) responsible for controlling system startup, shutdown, and other system-level actions.

A system manager typically has to arrange for the bring-up and stopping of services, which it may do by also being – or by delegating to – a supervision system or service manager. A system manager includes an init process which is launched by the kernel as the first userspace process at boot.

It’s worth noting at this point that, while a service manager built on a supervision system typically requires tight coupling with the other system – it needs to know the specific details of how to start and stop services, and to observe changes in service state – a system manager can, in comparison, maintain quite a loose coupling; it only needs to tell the supervision system (or service manager) to start, and to stop, and can leave the handing of individual services to the supervisor’s care.

I should add that different systems use different terminology for what Systemd calls “units”, the basic concept of a thing that can be started and stopped and can have dependencies on other units. In Systemd terminology, a “service” and a “target” are different types of unit. Other systems just stick with “service” for everything, regardless of whether there’s a process or other functionality attached. The distinction isn’t particularly useful here, so I’ll use the terms unit, target, and service more-or-less as synonyms.

Pure supervision as service management

In my definitions above, I outlined the primary distinction between supervision systems and service managers as being.a question of dependency management.

However, a system where services technically have interdependencies can work with a supervision system that doesn’t manage dependencies. In the most basic form, it’s possible to rely on the fact that a service will naturally fail if its dependencies are not satisfied; it should then be restarted (ideally with a gradually increasing delay) by the supervisor, until the dependency itself has become available.

It may also be possible to explicitly start any dependencies as part of a service’s startup script (and optionally also stop known dependents as part of a stop script). The runit documentation suggests:

  • before providing the service, check if all services it depends on are available. If not, exit with an error, the supervisor will then try again.
  • optionally when the service is told to become down, take down other services that depend on this one after disabling the service.

Certainly this can work. Although in general checking for dependencies being available prior to starting is prone to a race condition (nothing prevents a dependency from stopping just after the check is made), this seems unlikely to be a common problem in practice.  In fact the joint technique outlined above allows a quite simple supervision system to provide much of the functionality associated with a service manager, provided that the dependencies are correctly encoded in the start/stop scripts.

However, that niggling race condition remains. For services which, for whatever reason, won’t behave as we want them to when dependencies are (or become) unavailable, this could potentially be problematic. Is it a stretch to claim that such services may in fact exist? Maybe it is, though I’m not particularly willing to vouch that various web app frameworks won’t lock themselves up if the DBMS becomes unavailable for a little too long, for example.

There’s also the fact that continuously polling to start services will consume system resources (only very little, if the “check for dependencies first” approach advocated by the runit documentation is followed; perhaps a significant amount if it’s not). It may also make noise in log files: service X can’t start, service X still can’t start, …, and so on. And a polling approach means that, when the dependencies of some service do become available, there may be a little delay before the service itself starts: the supervisor has to decide to try and start it again, and has no cue to do this over than by some timer expiring. These by themselves are minor issues, of course.

One advantage of proper dependency-handling service management is that you can usually query the system for dependency information (“what other services will need to be started in order to start service X?”, “what is the total set of dependencies for service X?”, etc).

Laurent Bercot, S6-RC author, gives his own argument for dependency management:

The runit model of separating one-time initialization (stage 1) and daemon management (stage 2) does not always work: some one-time initialization may depend on a daemon being up. Example: udevd on Linux. Such daemons then need to be run in stage 1, unsupervised – which defeats the purpose of having a supervision suite.

This seems a fair point and a good example, though I’m not sure it would be impossible to supervise even udevd in a supervision-only system (even if it might require tweaking the existing systems a little).

I’m certainly in favour of dependency-managing systems (and of course Dinit is such a system), though I’m aware the arguments for it may sound a little wishy-washy, and to some degree it’s a matter of personal preference.

Complexity level of dependency relationships

Different service managers provide different dependency configuration options, with differing levels of complexity.

At the most simple end, S6-RC offers only a single type of dependency: that is, a service can depend on another, and will not start unless the other starts first. However, it appears to be unusual in this regard. Many systems have the concept of a soft dependency – which should be started with a dependent, but for which failure should not cause the dependent to also fail. The “hard” and “soft” dependencies are termed differently in different systems (needs, requires, depends-on vs wants, waits-for).

The benefit of a soft dependency is essentially that you can enable a service but not have its failure prevent your system from booting due to the rollback that results (assuming that the system performs such rollback; discussion of activation model and rollback yet to come).

OpenRC has both a needs and a uses/wants relationship (“uses” vs “wants” in this case have different semantics depending whether the dependency has been enabled in the current runlevel; most other service managers have largely done away with the concept of runlevels).

Nosh has requires and wants relationships, and separately supports start ordering relationships (before/after, indicating that another service’s start/stop should be ordered with respect to this service, even if there is no dependency between them). Nosh dependencies can be specified in both directions (this service requires that service, this service is required-by that service). It also has a conflicts relationship: if one service is started it can force another to stop, and vice versa.

Systemd is a law unto itself, with more dependency types than you can count on one hand; consider it as Nosh++ (though I believe Systemd came first, and Nosh borrowed from it, rather than the other way around). It’s not clear how commonly useful most of the dependency types are, though they were presumably implemented with reasons in mind.

For Dinit, I eventually opted for three dependency types: depends-on (requires), waits-for (wants), and depends-ms (depends as a milestone; the dependency must start for the dependent to start, but once started it effectively becomes a waits-for dependency). The latter, depends-ms, is of somewhat dubious value and may be removed if I cannot find a compelling scenario for it. In my eyes three dependency types (or even better, two) is a nice middle ground giving good functionality with relatively low complexity.

Systemd documentation mentions the common requirement for a dependent to start only once the dependent has properly started:

It is a common pattern to include a unit name in both the After= and Requires= options, in which case the unit listed will be started before the unit that is configured with these options.

I do not see any compelling reason for having ordering relationships without actual dependency, as both Nosh and Systemd provide for. In comparison, Dinit’s dependencies also imply an ordering, which obviates the need to list a dependency twice in the service description. (edit: a problem caused by separating ordering and dependency is described in this Systemd bug ticket).

Activation model of service managers

Suppose that we have two services – A and B – and that the first depends on the second. When A is started, B will also be started. The question is: what if A is then stopped?

There are two somewhat reasonable answers:

  1. Since the action was to start and stop a single service, the state of all services should return to what it was before either action. B should therefore stop, since it has not been explicitly started (i.e. rollback should occur naturally).
  2. Services should start, or stop, only when required to do so. Since B started when A was started, and has not been required to stop, it should not stop.

I believe that most systems take the 2nd approach, but Dinit takes the first (and tracks which services have been explicitly activated versus which have only started due to being required by a dependent).

I am not certain that either approach is definitely better than the other. The first provides a nice consistency for the scenario described (starting and then stopping a service will generally return the system to the original state), and avoids potentially leaving unneeded services running; the second on the other hand reduces overall service transitions.

Advocating for the first approach, one benefit is that it is simple to emulate runlevels. If you set up each runlevel as a service (target, unit) which depends on the services that should run in that runlevel, then you can “switch runlevels” by starting the new runlevel service and stopping the old one. There is no need to explicitly set any services to stop: if they are not required by the current active runlevel, they will stop anyway (although additional services can always be activate via an explicit command).

(Compare to Systemd’s approach to runlevels: it implements a separate command, “isolate”, to deactivate services not belonging to the new runlevel).

Also, with the first approach, boot failure is detectable as all services stopping without having received a shutdown command. That is, “boot” is a service with dependencies; if one of the necessary dependencies fails to start, “boot” will also fail, and at that point it releases all other (successfully started) dependencies, so that they then stop. There is no need to have “special” knowledge of the boot service, or to have a special failure case for that particular service. This is arguably just an implementation detail, though.

Now advocating for the second approach: consider the case of repeatedly attempting to start a service which has several dependencies, but which is failing due to a configuration issue: the administrator tries to start the service, and watches as its dependencies start and then stop again since the service itself failed to start. They then attempt to repair the configuration, but do not succeed, and on attempting to start the service again see the dependencies bounce up and then down a second time (let’s hope they get it right the third time…). This would be avoided with the second approach, since the dependencies would simply remain active when the service failed to start.

The problem described above could probably be avoided, even with the first approach, in various ways, but any solution would no doubt add a little more complexity to the system.

I personally still find the first model more natural and compelling – but again, it’s arguably just personal preference.

Special targets

Some systems have special targets with special semantics. Often certain targets are started to perform, or as part of, particular system actions: a shutdown target can be started when the system is to shut down, for example. Systemd has a large list of special targets, including targets that get created by Systemd when certain hardware is detected, and targets to represent mount points, which Systemd has special handling for.

Systemd also adds dependencies automatically to or from special targets. For the basic target:

systemd automatically adds dependency of the type After= for this target unit to all services (except for those with DefaultDependencies=no).

And for the dbus.socket unit:

A special unit for the D-Bus system bus socket. All units with Type=dbus automatically gain a dependency on this unit.

(The dbus unit is for launching the D-Bus daemon, and causes Systemd to connect to the bus after the unit starts. Systemd and D-Bus are somewhat intertwined; D-Bus has the ability to start service providers by communicating with Systemd, and Systemd exposes various services via D-Bus, as well as being able to determine that a service is ready via a D-Bus name becoming available).

Other service managers don’t tend to have as many special targets. Nosh documents a few in its system-control man page, but not as many as Systemd, and it has no special relationship to D-Bus for example. Dinit uses boot as the default service to start, but otherwise does not treat that service specially in any way; other design choices (such as the activation model) made special treatment unnecessary.

Service description/configuration mechanism

A number of supervision/service managers have gone with a “directory-per-service” approach (which I think perhaps was pioneered by daemon-tools? I’m not sure). In the directory you have a script used to run the service, some files which each contain a parameter setting, and perhaps a subdirectory containing links to dependencies. (That’s a broad stroke; many of the systems have subtle differences. S6-RC dependencies are listed one-per-line in a “dependencies” file for example). The benefit of having one-setting-per-file is that it requires no parsing and makes the system simpler. The downside is that it is a little bit more complicated to easily check the whole service configuration (though tooling can help).

Other systems – including the venerable Sys V init, as well as OpenRC – simply have a script per service. In the case of OpenRC, the script (optionally) has a special interpreter, openrc-run, which offers dependency handling functions. Various metadata is extracted from the scripts (and cached in a separate database).

Dinit, and Systemd, both use a single file per service (“.ini” style). I find this more convenient for editing service descriptions generally; the downside is that parsing is required. In the case of Systemd running as system manager, this means parsing in the PID 1 process, which many would frown upon. I’m not convinced this is really a big problem (*); Dinit’s configuration parser is quite simple and has proved robust (in my own use) – though it’s worth noting that Dinit doesn’t demand that it runs as a system manager (PID 1), whereas Systemd does expect this (“Note that it is not supported booting and maintaining a full system with systemd running in --system mode, but PID not 1″).

(* edit: the “not a big problem” I was referring to here was parsing in general, not the parsing in Systemd, which has historically been problematic at times – though even that has, as best as I can tell, been significantly improved and become better tested).

S6-RC is unusual in that it requires the service descriptions to be compiled into a database. OpenRC, as mentioned, also stores service metadata separately to the service script, but only as a cache. In either case, I suppose it is potentially possible for the compiled data and the source to become inconsistent, though I doubt it is much of a problem in practice.

Monolithic vs modular process design

One question around the design of a supervision/service/system manager is, how many processes should make it up? A number of the smaller and simpler systems have gone for the approach of breaking things up into many processes. Taking S6-RC as a case in point, the service manager (S6-RC) is separate to the main supervision process (s6-svscan of S6) which in turn runs supervisor processes (s6-supervise) which, finally, run the service process. Typically the service process is launched via an execline script, which allows calling various chain-loading subprograms to set up environment, UID/GID, etc.

The idea behind breaking things up this way is, essentially, that it allows each component to be small, simple, and “obviously correct”. There are those who argue that this approach fits the “unix philosophy” of “do one thing and do it well”. This is not an entirely bogus argument; by limiting the function of an individual program, it’s somewhat easier to make sure that the program is fundamentally correct.

On the other hand, composing multiple small programs into a more complex system still results in, well, a more complex system. If the functions of a system can easily be decomposed into separate processes, they can most likely be decomposed to individual modules within a single-process program as well. (And, having multiple processes comes with its own disadvantages: certain system-level functionality is only going to be possible to implement by communicating between modules; if the modules are separate processes, that means inter-process communication, and in general that’s going to increase complexity significantly. This might not prove to be a problem for a service manager, though, if the need for such communication is really limited).

The main point that I am trying to make is that breaking functionality into separate processes does not make the overall system any simpler. It may offer an advantage in terms of making it possible to use the individual components separately, but it’s not clear to me that this is really useful. Probably the main real benefit is, potentially, an increase in robustness: if one of your various sub-processes does crash, it won’t necessarily bring down the whole system.

Enter Systemd into the discussion. Systemd insists on incorporating not only service management and supervision into a single process, but system management as well: it wants to run the whole thing as PID 1, a process which, if it crashes, causes the kernel to panic (at least on Linux) and thus really does bring the whole system tumbling down. (Edit: to be fair, Systemd tries hard not to actually crash, but to catch eg SIGSEGV and go into a mode of limited operation which allows the system to function enough that you can sync filesystems before shutting down).

For Dinit, in comparison, I felt no concern about having just service management and supervision all in a single process. And in fact, Dinit does support running as a system manager, within the same process – but it does not require this; Dinit’s quite happy to act as a system-level service manager but have another process be the system manager. Additionally, Dinit is just generally far simpler than Systemd (as should be clear by now).

Some people are always going to prefer breaking things up into processes that are essentially as small as possible: I can understand this to an extent, I just don’t agree that it’s always a worthwhile goal, and I don’t think that Dinit suffers from being less modular than many of the alternatives.

Robustness and failure modes

The decision to write important system-level software in non-memory-safe languages such as C and C++ has been criticised. Yet, such software continues to be written in such languages (although certain other options such as Rust and Go have been gaining traction recently).

One of the systems I haven’t mentioned up this point is GNU Shepherd; mainly, my concern is that it’s written in Guile, an interpreted (or bytecode-interpreted) language with garbage collection – and I see both the “interpreted” and “garbage collection” parts as undesirable for system-level software (especially for a potential init). Interpreted software will be less efficient (if not in actual speed, since I’ll acknowledge that JITs can do amazing things, at least in memory usage) and garbage collection presents a similar issue. If the software was so complex that we couldn’t make it robust without using a memory-safe language/runtime – and if we weren’t willing to use Rust or another GC-less option for some reason – then perhaps the use of GC would be acceptable, but I don’t believe that’s actually the case; Dinit has so far proven to be robust, and even Systemd, despite early foibles, rarely actually crashes (even if it fails in other ways, as occasional rumbles on the web suggest).

A real concern of GC’d languages generally is, can programs in these languages be made resilient to out-of-memory conditions (are allocations even always explicit)? I haven’t looked closely enough at Shepherd to be able to pass comment, but I would not be surprised if it turned out that memory allocation failure is not something it is designed to handle (I’d be happy to be shown otherwise). Despite the low probability of an out-of-memory situation occurring, I still think it’s something that a service manager – and especially a system manager – needs to be able to deal with.

Conclusion

Well, that ends our tour of concerns. If you got this far – thanks for reading, and I hope it was interesting and informative. There are of course a lot of other aspects of service manager design – and some unique features of particular systems – but this article has gotten quite long already. Please feel free to add constructive comment, correction or discussion.

Wrap on integer overflow is not a good idea

A discussion of undefined behaviour and compiler optimisation, particularly in regards to signed integer overflow.

C (and C++) compilers are becoming notorious for exploiting the notion of undefined behaviour – the idea that certain things a program might do have no behaviour proscribed by the language standard, and that the compiler can assume the program doesn’t do these things when it is generating object code. Quite a few people have been objecting to this, since it can result in the generated code not doing what the programmer intended; the problem is becoming more noticeable over time, as compilers introduce more sophisticated optimisation techniques which are more likely to exploit the notion.

One prominent example is that of signed integer overflow. Most C programmers are developing for machines which use a 2’s complement representation of integers; addition and subtraction, with such a representation, is implemented in exactly the same way as for unsigned arithmetic. If the addition of two positive signed integers overflows – that is, if the result is larger than can be represented – the processor will produce a number that, when interpreted as a 2’s complement signed integer, will appear to be negative. This is called “wrapping” because the value has “wrapped around” from the high end of the numeric range to the low end.

For this reason, you occasionally see C code that looks something like this:

int b = a + 1000;
if (b < a) { // overflow
    puts("input too large!"); return;
}

The “if” statement is designed to detect the overflow condition (in this case from adding 1000 to the value from the variable ‘a’) and report an error. The problem is that, in C, signed integer overflow is one case of undefined behaviour. Compilers, for some time now, have performed an analysis which shows that the condition can never be true: if I add 1000 (or any positive number) to another value, the result cannot be smaller than the original value; if overflow occurred, that is undefined behaviour, and it is the programmer’s responsibility (arguably) to ensure that their program never exhibits such behaviour. Therefore, the compiler may decide that the entire if statement can be removed as an optimisation (it can never be true, it can never have an effect, it may as well not be there).

The problem with this compiler optimisation, in this case, is that it has removed the test that the programmer specifically used in an attempt to detect the overflow situation and handle it. An example of this with a real compiler can be seen here. (Side note: the godbolt.org site on which that example is hosted is great! you can edit the code and see the compiled form with a wide range of compilers. Play with it!). Observe that the overflow check is not removed if the type is changed to an unsigned integer, since unsigned overflow has defined behaviour in C (or rather, more accurately, unsigned arithmetic is defined to wrap and thus the overflow does not actually occur).

So is this wrong? Some have argued that it is, though it’s clear that many compiler vendors feel that it’s legitimate. The main arguments made by proponents of (edit: implementation-defined)wrapping overflow behaviour, if I understand them correctly, boil down to variants of the following:

  • Wrapping on overflow is a useful behaviour.
  • Wrapping is the behaviour expected by programmers.
  • Exploiting undefined behaviour semantics on overflow gives no significant benefit.
  • The C language standard, in regards to undefined behaviour, gives license for implementations “ignoring the situation completely, with unpredictable results”, but this doesn’t allow optimisations to assume that the situation for which the undefined behaviour is proscribed will not come about.

Let’s look at these one by one:

Wrapping on overflow is a useful behaviour?

The main utility for a wrapping behaviour is to be able to detect overflow after it occurs. (If there are other uses, that could not be handled using unsigned integers instead, I am not immediately unable to think of any and suspect they are rare). While this would indeed simplify the problem of avoiding the use of erroneously overflowed results, it certainly doesn’t help in all cases (consider multiplication, or addition of two unknown quantities with unknown sign).

For the trivial case where wrapping behaviour does allow simply detecting overflow after it occurs, it is also straightforward to determine whether overflow would occur, before it actually does so. The example above can be rewritten as follows:

if (a > INT_MAX - 1000) { // would overflow
    puts("input too large!");
    return;
}
int b = a + 1000;

That is, you can perform a check to see whether the result of an addition will exceed the maximum representable value, rather than performing the addition and then trying to determine whether that overflow occurred by checking if the result is mathematically inconsistent. (If the sign of both operands is unknown, the check becomes significantly more complicated, but this is also true when checking for overflow after the operation with wrapping overflow semantics).

With this in mind, I’m not really convinced that wrapping overflow is generally useful.

Wrapping is the behaviour expected by programmers?

It’s more difficult to argue against this point, since clearly at least some C programmers have written code which expects wrapping semantics for signed integer overflow. However, I don’t think that this alone is a strong argument for implementing wrapping semantics by default (note that several compilers implement options for wrapping overflow, if it really is desired).

An obvious mitigation for the problem of programmers expecting this particular behaviour is for the compiler to issue a warning when it optimises based on the alternative undefined-behaviour-is-assumed-not-to-occur semantics. Unfortunately as we see in the godbolt.org link above, compilers don’t always do so (Gcc 7.3 does but 8.1 does not, so this appears to be a regression).

Exploiting undefined behaviour semantics on overflow gives no significant benefit?

If true in all cases this would be a compelling argument for having compilers default to wrap-on-overflow, since it is probably better to allow the “detect overflow after it occurs” mechanism described above to work even if it is technically incorrect – if only because that mechanism may be in use in code which is arguably broken.

I suspect that with typical C programs the benefit of this particular optimisation (removing checks for mathematically impossible conditions) is usually negligible, because C programs tend to be written by programmers who are seeking good performance and who tend to hand-optimise their code anyway: that is, if it’s obvious that particular “if” statement has a condition that can never be true, the programmer would likely have removed the statement themselves. Indeed, a search reveals a few studies where the effectiveness of this optimisation has been questioned, tested, and found to be mostly insignificant for the particular benchmarks under test. However, while in many cases there is no benefit for C, the code generation engines and optimisers in compilers are commonly general and could be used for other languages where the same might not be so generally true; consider C++, where it is somewhat idiomatic in templated code to rely on the optimiser from removing redundant code, rather than doing it manually. There is also the case of languages being transpiled to C and relying on the C compiler to optimise away redundant code.

Also, even without overflow check elimination, it is not necessarily correct to assume that wrapping integers has minimal direct cost even on machines which use 2’s complement representation. The Mips architecture, for example, can perform arithmetic operations only in registers, which are fixed size (32 bit). A “short int” is generally 16 bits and a “char” is 8 bits; if assigned to a register, the underlying width of a variable with one of these types will expand, and forcing it to wrap according to the limit of the declared type would require at least one additional operation and possibly the use of an additional register (to contain an appropriate bitmask). I have to admit that it’s been a while since I’ve had exposure to any Mips code and so I’m a little fuzzy on the precise cost involved, but I’m certain it is non-zero and other RISC architectures may well have similar issues.

The language standard does not allow for signed integer overflow not to wrap, if that’s what the underlying architecture does?

This argument is particularly weak when examined. It essentially states that there is a requirement that “undefined behaviour” actually grants only limited license to the implementation (compiler), by the text of the standard. What the text that proponents latch on to says precisely is the following, as part of the definition of undefined behaviour:

NOTE Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, …

The claim is that “ignoring the situation completely” would not allow for assuming that the situation leading to the undefined behaviour – overflowing addition, for example – could not happen, but rather that, if it does happen, the implementation must carry on as if it did not happen but must respect the result it would obtain from asking the processor to perform such an operation (putting it another way: as if the translation from source to machine code was direct and naive).

First, we should observe that this text is in a NOTE and therefore non-normative (may not proscribe behaviour), according to the ISO directive mentioned in the foreword of the same document:

In accordance with Part 3 of the ISO/IEC Directives, this foreword, the introduction, notes, footnotes, and examples are also for information only.

Given that the “possible undefined behaviour” appears in such a note, it is not proscriptive. Note that the actual definition text for “undefined behavior” reads:

behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements.

I have added emphasis on the important part: there are no requirements for undefined behaviour; the list of “possible undefined behaviors” in the note contains merely examples and cannot be definitive. “Imposes no requirements” is unambiguous.

Some extend the argument to say that, regardless of what the text actually says, the intention of the language committee when those words were drafted was that the behaviour should in general match that of the underlying hardware, as closely as possible, assuming a naive translation to machine code. This may be true, though I’ve not seen any evidence (such as historical text) that supports it. Even if it were true, however, it would not necessarily apply to the current incarnation of the text.

Final thoughts

The arguments for wrapping on overflow are mostly flawed. Probably the strongest argument that can be made is a combination: it is occasionally expected by less experienced programmers (who do not understand the nuances of C and of its undefined behaviour), and is not particularly harmful to performance – however, the latter is not true in all cases, and the former is a somewhat dubious reasoning when considered by itself.

Personally, I feel that I would much rather have trap on overflow than wrap. That is, I would rather that a program crash instead of continuing with either undefined behaviour or a potentially incorrect value, either of which could be a security issue. This would certainly have a slight performance impact on most(?) architectures, particularly x86, but on the other hand it would immediately flag overflow bugs rather than allowing them to be exploited or to produce incorrect results further down the line. It could also in theory allow the compiler to safely remove redundant comparisons following a potential overflow, because it ensures that they really can’t happen, though I note that Clang and Gcc both apparently fail to take advantage of this.

Fortunately, both trapping and wrapping options are implemented by the compiler I use most often, which is Gcc. The “-ftrapv” and “-fwrapv” command line arguments can be used to enable each respectively.

There are of course a number of other causes of undefined behaviour; integer overflow is only one. I don’t necessarily think that all of these are useful and I do think that there are plenty of specific cases where the semantics should be defined by the language, or at least classified as implementation-defined. And I’m wary of compiler vendors being too liberal in their interpretation: if the compiler behaves in ways that are counter-intuitive, especially for someone who has read the language specification themselves, there is always the risk of real software bugs resulting; if the opportunities for optimisation that such an interpretation opens up are negligible, it is hardly worthwhile to adopt it. An examination of some issues around this area may be the topic of a future post.

Addendum (24 Aug 2018)

I’ve realised that much of the above could be better written. To briefly summarise, clarify, and add some minor points:

  • I was not trying to argue that undefined behaviour on overflow is preferable to wrapping, but rather that wrapping is not much better than undefined behaviour on overflow in practice. In particular, you can get security issues from wrapping behavior in much the same way as you can with undefined behaviour – and I’d argue that many security issues resulting from unchecked integer overflow, other than those which come from the compiler removing erroneous post-overflow checks, actually come from the fact that the value has wrapped around rather than any other undefined behaviour associated with the overflow.
  • The only real benefit of wrap-on-overflow is that it doesn’t cause post-overflow checks to be removed. While that might eliminate some attack vectors, it leaves open the possibility that some overflows won’t be checked for at all (i.e. the programmer did not include an overflow check) and will be uncaught.
  • If security is not a concern but speed of execution is, undefined behaviour on overflow may allow better optimisation possibilities and provide a performance benefit, at least in some cases. On the other hand if security is a concern, wrap-on-overflow potentially leaves holes open.
  • This means that between trap-on-overflow, wrap-on-overflow, and undefined-behaviour-on-overflow, I see very few cases where wrap-on-overflow should be the preferred choice.
  • In regards to post-overflow checks, I have concerns that leaving them in place could lead to the wrong perception (that post-overflow checks work and are guaranteed to work). Trap-on-overflow avoids this problem. Good warnings help to alleviate it.
  • I think that any programmer writing security-sensitive code will ideally have a good grasp of the semantics of, and the potential pitfalls of, the language they are writing in. For C, this means understanding the overflow semantics and nuances of undefined behaviour. It’s unfortunate that some C programmers still don’t seem to have that level of understanding.
  • I have seen a claim that “most C programmers expect wrapping behaviour”, but I’m not aware of any evidence that this is true. (I’ve said “some” above since I’ve seen anecdotal evidence and I doubt this would be disputed anyway).
  • There are two separate issues: one is what the C language standard should require, and another is what the compilers should implement. I’m (somewhat) ok with the language standard specifying undefined behaviour for overflow. This post is arguing for what the behaviour of compilers should be.
  • Trap-on-overflow need not require that every operation is checked for overflow; ideally it would only mandate that the program either behaves in a mathematically consistent manner or aborts, which allows for “temporary overflow” that doesn’t generate an incorrect result. This would allow optimising “a + b – b” to “a” (which wrapping also does) and “(a * b) / b” to “a” (which wrapping doesn’t).

Addendum (21 Nov 2018)

It turns out that the GCC implementation (circa 8.2) of -ftrapv is terrible – it doesn’t detect all overflow cases, and it causes a large performance degradation due to farming out most arithmetic operations to support functions (i.e. it imposes a function call overhead for a simple addition). The Clang/LLVM implementation is much better.

 

Escape from System D, episode V

Well, yes, I’m still working on Dinit, my portable and “lightweight” intended-as-an-alternative to Systemd. The first commit was on August 27, 2015 – just under three years ago – and my first announcement about Dinit on this blog was on June 14 last year. In looking up these dates, I’m surprised myself: I was working on Dinit for two years before I wrote the introductory blog post! It didn’t feel like that long, but it goes to show how long these things can take (when you’re working as a one-man development team in your spare time).

I recently issued a new release – 0.2.0, still considered alpha – with some new features (and bugfixes), and am planning a 0.3.0 release soon, but progress certainly has been slow. On the other hand, things really have come a long way, and I’m looking forward to being able to call the software “beta” rather than “alpha” at some point soon (though I suppose it’s open question if those terms really mean much anymore). One year in seems like a good time for a retrospective, so here it is; I’ll discuss a number of things that occur to me about the experience of developing some non-trivial software as a lone developer.

On software quality

One thing that’s always bothered me about open-source projects, although it’s not universally true, is that the quality isn’t always that great. There are a huge number of half-done software projects out there on Github (for example), but more importantly there are also a large number of 95% done projects – where they are basically working, but have a number of known bugs which have been sitting in the issue tracker for a year or more, and the documentation is mostly-correct but a bit out-of-date and some of the newer features aren’t mentioned at all. Build documentation is often seen as optional; you can always “just run ./configure –help” though of course it’s not entirely clear what all the options do or how they affect the result, and in my experience the chance that a configure script correctly checks for all the required dependencies is pretty low anyway.

Take the source of any major project, even an established one, and do a search for “TODO” and “XXX”, and the results are often a little disturbing. I try to avoid those in Dinit, though to be fair the count is not zero. There are some in Dasynq (the event-loop library which I’ve also released separately), and some in Dinit’s utility programs (dinitctl and shutdown), but at least there are none in the Dinit core daemon code. But keeping it that way means consistently going back over the code and fixing the things that are marked as needing fixing – or just avoiding creating such holes in the first place. By the time I release version 1.0 I’d like to have no TODO comments in any of the Dinit code.

Documentation is another thing that I’ve been very careful about. Whenever I add any feature, no matter how small, I make sure that the documentation gets updated in the same or the very next commit. I’m glad to say that the documentation is in really good shape; I plan to keep it that way.

Also, tests are important. I don’t enjoy writing them, but they are really the only way I can ensure that I don’t cause regressions when I make changes or add new features, and it satisfying to see all those “PASSED” lines when I run “make check”. I still need to add more tests, though; some parts of the code, particularly the control protocol handling and much of the service description loading, don’t have tests yet.

On autoconf and feature checks and portability

Dinit doesn’t use autoconf and doesn’t have a “configure” script. Basic build settings like compiler and compiler switches are specified in a configuration file which must be hand-edited, though this process isn’t onerous and will generally take all of a whole minute. I wouldn’t be against having a script which would probe and determine those particular settings but I also don’t see a strong need for such a thing.

In terms of system call features, Dinit largely sticks to POSIX, and in the few cases where it doesn’t it uses an #ifdef (eg `#if defined(__FreeBSD__)’). The latter probably isn’t ideal, but the danger of feature checks for system calls is that they usually can only check for the existence of a function with a particular name, and not that it does what we need it to do; I think I’d rather that you have to explicitly specify in the build configuration that such-and-such a call is available with the right semantics than to just check it exists and then blindly assume that it is what we think it is, but just checking for specific systems seems like a nice compromise, at least during development.

As it is now, if you run a current version of Linux, FreeBSD, OpenBSD or MacOS then you can build by editing a single file, uncommenting the appropriate section, and then running GNU make. I’ve also experimented briefly with building it on Sortix but ran into an issue that prevented me from getting it working.

On contributions (and lack thereof)

I’ve had one very minor contribution, from the one person other than myself who I know actually uses Dinit (he also maintains RPM packages of Dinit for Fedora and CentOS). I do sometimes wish that others would take an interest in the development of Dinit, but I’m not sure if there’s any way I can really make that happen, other than by trying to generate interest via blog posts like this one.

What I really should do, I guess, is clean up the presentation a bit – Dinit’s README is plain text, whereas a markdown version would look a lot more professional, and I really should create a web page for it that’s separate to the Github repository. But whatever I do, I know I can’t be certain that other contributors will step forward, nor even that more than handful of people will ever use the software that I’m writing.

On burnout (and avoiding it)

Keeping the momentum up has been difficult, and there’s been some longish periods where I haven’t made any commits. In truth, that’s probably to be expected for a solo, non-funded project, but I’m wary that a month of inactivity can easily become three, then six, and then before you know it you’ve actually stopped working on the project (and probably started on something else). I’m determined not to let that happen – Dinit will be completed. I think the key is to choose the right requirements for “completion” so that it can realistically happen; I’ve laid out some “required for 1.0” items in the TODO file in the repository and intend to implement them, but I do have to restrain myself from adding too much. It’s a balance between producing software that you are fully happy with and that feels complete and polished.

On C++

I’ve always thought C++ was superior to C and I stand by that, though there are plenty who disagree. Most of the hate for C++ seems to be about its complexity. It’s true that C++ is a complex language, but that doesn’t mean the code you write in it needs to be difficult to understand. A lot of Dinit is basically “C with classes (and generic containers)”, though I have a few templates in the logging subsystem and particularly in Dasynq. I have to be very careful that the code is exception safe – that is, there’s nowhere that I might generate an exception and fail to catch it, since that would cause the process to terminate (disastrously if it is running as “init”) – but this turns out to be easy enough; most I/O uses POSIX/C interfaces rather than C++ streams, and memory allocation is carefully controlled (it needs to be in any case).

I could have written Dinit in C, but the code would be quite a bit uglier in a number of places, and quite frankly I wouldn’t have enjoyed writing it nearly as much.

Of course there are other languages, but most of the “obvious” choices use garbage collection (I’d rather avoid this since it greatly increases memory use for comparable performance, and it often comes paired with a standard library / runtime  that doesn’t allow for catching allocation failures). Rust might seem to be a potential alternative which offers memory safety without imposing garbage collection, but its designers made the unfortunate choice of having memory allocation failure cause termination – which is perhaps ok for some applications, but not in general for system programs, and certainly not for init. Even if it weren’t for that, Rust is still a young language and I feel like it has yet to find its feet properly; I’m worried it will mutate (causing maintenance burden) at a rate faster than the more established languages will. It also supports less platforms than C++ does, and I feel like non-Linux OSes are always going to be Rust’s second-class citizens. Of course I hope to be proved wrong, but the panic-on-OOM issue still makes Rust a non-starter for this particular project.

On Systemd

Even when I announced Dinit after working on it for some time I struggled to explain exactly why I don’t like Systemd. There have been some issues with its developers’ attitudes towards certain bugs, and their habit of changing defaults in ways which break established workflows and generally caused problems that were seen by many as unnecessary (the tmux/screen issue for example), but few specific technical issues that couldn’t be classified as one-off bugs.

I think what really bothers me is just the scope of the thing. Systemd isn’t an init system; it’s a software ecosystem, a whole slew of separate programs which are designed to work together and to manage various different aspects of the system, not simply just manage services. The problem is, despite the claims of modularity, it’s somewhat difficult to separate out the pieces. Right from the start, building Systemd, you have a number of dependencies and a huge set of components that you may or may not be able to disable; if you do disable certain components, it’s not clear what the ramifications might be, whether you need to replace them, and what you might be able to replace them with. I’d be less bothered if I could download a source bundle just for “Systemd, the init daemon” and compile that separately, and pick and choose the other parts on an individual basis in a similar way, but that’s just not possible – and this is telling; sure, it’s “modular” but clearly the modules are all designed to be used together. In theory you may be able to take the core and a few select pieces but none of the distributions are doing that and therefore it’s not clear that it really is possible.

Also, I think it’s worth saying that while Systemd has a lot of documentation, it’s not necessarily good documentation. For example (from here):

Slices do not contain processes themselves, but the services and slices contained in them do

Is it (a) slices do not contain processes or (b) slices do contain processes?

This is just one example of something that’s clearly incorrect, but I have read much of the Systemd documentation a number of times and still struggled to find the exact information I was looking for on any number of occasions. And if you’re ever looking for details of internals / non-public APIs – good luck.

Regardless of whether Systemd’s technical merits and flaws are real, having another option doesn’t seem like a bad thing; after all, if you don’t want to use it, you don’t have to. I’m writing Dinit because I see it as what Systemd could have been: a good and reliable standalone service manager with dependency management that can function as a system init.

On detractors and trolls

I guess you can’t take on something as important as an init system and not raise some eyebrows, at least. Plenty of comments have been made since I announced Dinit that are less than positive:

(for the record, not trolling, not a newbie – if that is even a bad thing. And it is both stable and crossplatform).

Or this one:

(If you say so, though I can see some irony in accusing someone of hubris and then immediately following up with a tweet essentially claiming that you yourself are the only person in the world who understands how to do multi-process supervision).

Maybe I bought the last one on myself to some degree by saying that I was aware I could be accused of NIH and that I didn’t care – I was trying to head off this sort of criticism before it began, but may have inadvertently had the opposite effect.

Then, there’s the ever-pleasant commentary on hacker news:

>I’m making an init system

Awesome, maybe I won’t have to!

>C++

Whelp, nevermind.

(Dear Sir_Cmpwn of hacker news: I am quietly confident that my real init system written in C++ is better than your vapour-ware init system that is written in nothing).

And of course on Reddit:

> It will be both efficient and maintainable. It will be stable. Solid-as-a-rock stable.

Author does not have any tests whatsoever and uses a memory unsafe language. I don’t see how he wants to achieve the above goals.

(I know that it is difficult to believe, but truly, it is possible to write tests after you have written other code).

Anyway, this is the internet; of course people will say bad (and stupid) things. There were plenty of positive comments too, such as this one from hacker news:

I’m not a detractor, but there are many things systemd can still improve, but it feels we’re kind of stuck. I’m quite happy if we have some competition here.

Yes! Thank you. There were also some really good comments on my blog posts, and some good discussion elsewhere including on lobste.rs. Ultimately I’ve had probably as much positive as negative feedback, and that’s really helped to keep the motivation up.

The worst thing is, I’ve been guilty of trash-talking other projects myself in the past. I’ve only done so when I thought there was genuine technical issues, and usually out of frustration from wanting software to be better, but that’s no excuse; it doesn’t feel good when someone says bad things about software (or other work) that you created. If only one good thing comes from writing Dinit, it’s that I’ve learned to reign in my rants and focus on staying objective when discussion technical issues.

I guess that’s about a wrap – thanks for reading, as ever. Hopefully next time I write about Dinit it’ll be to report on all the great progress I’ve made since now!