Readiness protocol problems with Unix dæmons

One of the notions that some Unix and Linux dæmon management subsystems incorporate is that of service readiness. Such service management subsystems incorporate a notion of distinct "started"/"spawned" and "ready"/"running" service states, and a notion of one service depending from other services being both active and ready. A dæmon may be active simply by dint of there being a process running, but it is not ready until it has completed its initialization, opened the server ends of whatever client-server mechanisms it uses, and is about to enter its main request processing loop. (For some servers — such as, for example, RabbitMQ when it has a lot of persistent messages saved on disc — this difference is a period of time that can be measured in minutes rather than milliseconds.) Of course, only the service program itself can determine exactly when this point is.

Not starting dependent services until a depended-from service is ready, along with early socket opening, address the thundering herd problem of parallelized client-server startup. In the thundering herd model, clients are simply started and restarted blindly by the service manager until they "stick"; the clients abending repeatedly with errors caused by an unreachable server until the server is both up and ready. Early socket opening changes the abend into the client blocking, attempting to send a request to or attempting to read the response from the server over the socket, until the server processes socket messages. Waiting for services to notify that they are ready assists with the client-server protocols that are not socket based, or with servers that cannot be persuaded to do early socket opening.

The notion of service readiness is expressed in upstart via an expect stanza in the service's job file, and in systemd via the Type parameter in the service's service unit; both of which select from a number of readiness protocols.

Implemented readiness protocols
service management subsystem			The service is considered ready …
upstart	systemd	s6	The service is considered ready …
no `expect`	`Type=simple`	default	… as soon as it starts.
`expect fork`	`Type=forking`	N/A	… after the main process `fork()`s a child and then `exit()`s.
`expect stop`	N/A	N/A	… after the main process `raise()`s a `SIGSTOP` signal.
`expect daemon`	N/A	N/A	… after the main process `fork()`s a child, that child `fork()`s a child, and then both parent and first child `exit()`.
N/A	`Type=oneshot`	N/A	… after the main process `exit()`s.
N/A	`Type=dbus`	N/A	… after a named server, given in the service unit, appears on the Desktop Bus.
N/A	`Type=notify`	N/A	… after a `READY=1` text message is sent over the notification message socket.
N/A	N/A	`notification-fd`	… after a newline is sent over the notification file descriptor.

Protocol mismatch

The upstart Cookbook warns about a few of the problems of readiness protocol mismatch, where one readiness protocol is specified for the service in its service management configuration, but the service program itself actually employs another. There are a number of them, applying not only to upstart but the other service managers as well, and they fall into four broad categories:

The service manager can think that the service has terminated when it has not. This happens when the service program fork()s a child process and exit()s the main process, and the service manager was not told this, or was not told the number of times that fork-child-exit-parent would happen. To the service manager, the parent process exiting signals that the service has terminated. Various things can result:
- The service manager tries to auto-restart the server, over and over, leading to multiple instances of the (forked children) service processes. Depending from what interlocking occurs in the service program itself, this can result in either multiple concurrent services or lots of ephemeral processes that start up, fail to lock resources, and then exit. In either case, it results in a lot of logfile traffic, both from the service manager and from the services themselves.
- The service manager incorrectly marks the service as "stopped", but leaving the forked children running, thereby losing track of the actual state and not knowing that it still has to terminate processes in order to bring the service down.
- The service manager incorrectly marks the service as "stopped", and to make sure that it doesn't lose track of the actual state terminates all of the forked children, thereby killing a running service.
The dæmon just hangs. This happens when the service program employs the upstart-specific SIGSTOP protocol and it's not running under upstart. Other service managers don't auto-continue stopped dæmons, because they permit SIGSTOP as a means for administrators to explicitly pause dæmon processes.
The service manager can think that the service never becomes ready when it has. This happens when the service manager is expecting one form of explicit notification and the service program sends another, or doesn't send one at all. Various things can result:
- The service manager just waits indefinitely for a notification that never comes. This blocks the start of dependent services.
- The service manager waits for a (configurable) timeout, and then incorrectly marks the service as "failed", leaving the forked children running and thereby losing track of the actual state and not knowing that it still has to terminate processes in order to bring the service down.
- The service manager waits for a (configurable) timeout, and then incorrectly marks the service as "failed", and to make sure that it doesn't lose track of the actual state terminates all of the forked children, thereby killing a running service.
The service manager can think that the service is ready when it is still initializing and will fail before becoming ready. The result is that dependent services are started too early, when there is not yet a dependency to serve them.

upstart's failure modes generally involve losing track of the actual state, the example in the Cookbook showing the case where it has marked as service as "stopped" when the forked children are actually running and the service is active. systemd's failure modes generally involve terminating the forked children and rendering a successfully activated service inactive. Asking why systemd is terminating a successfully started service, with nothing apparently wrong in the service's own logs, after the default wait for readiness timeout of 90 seconds is a common support question, that can be found in many discussion and Q&A fora.

No-one speaks the `forking` protocol.

Practically no-one speaks the forking (single or multiple) readiness protocol in the wild. This is for two reasons.

The first reason is simple: the protocol is highly specific. It doesn't cover just any old pattern of fork()ing child processes. Witness:

The unnecessary horror that is ossec-control forks great-great-grandchildren processes, which doesn't match the forking readiness protocol and banjanxes systemd. (There's one OSSEC dæmon that needs slightly different treatment.)
Wrapping Apache Tomcat in extra layers (a tale from the systemd House of Horror) doesn't match the forking readiness protocol and banjanxes systemd.
Wrapping WS02 Carbon Server in extra layers (a tale from the systemd House of Horror) doesn't match the forking readiness protocol and banjanxes systemd.

The second reason is more subtle: dæmon programs are not signifying readiness when they fork() child and exit() parent. They are attempting to do something else, which is nothing to do with readiness. The forking readiness protocol is an opportunistic re-use of widespread existing behaviour, but that behaviour isn't actually right for such re-use.

Indeed, it isn't actually right per se. What they are doing is trying to let system administrators start dæmons the 1980s way, where a system administrator could log in to an interactive session and simply start a dæmon by runing its program from the interactive shell. The fork()ing is part of a notion known as "dæmonization". It's widely believed, and commonly implemented, which is why it was opportunistically re-used. However, it is a widely held fallacy, common because a lot of books and other documents that just repeated the received wisdom of the 1980s perpetuated the fallacy, and because AT&T Unix System 5 rc was based upon this fallacy. It does not and cannot work safely, cleanly, and securely in systems since the 1980s and system administrators have three decades' worth of war stories that they tell about its failing. dæmons simply should not vainly try to "dæmonize" — something that the upstart Cookbook has been recommending since 2011, something that people using daemontools family toolsets have been recommending since the late 1990s, and something that IBM has been recommending since the early 1990s (when AIX's System Resource Controller came along).

Opportunistically re-using this ill-founded behaviour as a readiness protocol conflicts with its actual intent and implementation. If one wants to "dæmonize", so that a system administrator gets the shell prompt back, one does it early so that the system administrator doesn't wait around whilst the dæmon gets on with things asynchronously. And one finds that most programs that (singly or multiply) fork() child and exit() parent do so long before they have finished initialization and the service is actually ready to serve clients. Indeed, in many programs this is actually done first, before any initialization. This is because doing otherwise would lead to partially initialized resources that would then need to be cleaned up in the parent process; and to problems where linked-in software libraries might have done things as part of their initialization like spawning internal threads that the main program isn't even aware of, and that won't be carried into the forked child process, thus confusing the software library whose thread it is and leading to deadlocks, faults, and failures. It's actually problematic program design to fork() after all initialization when the program is finally ready.

At the time of the Debian systemd packaging hoo-hah several people opined that for best results dæmon programs should be altered to employ one of the protocols that actually is a readiness protocol at base, rather than relying upon this faulty reinterpretation of the "dæmonization" mechanism. This led to an analysis of the various other mechanisms.

Several incompatible protocols with low adoption

There is a wide choice of non-forking readiness protocols, some proposed, some implemented in service managers.

Scott James Remnant (2008-08-12). upstart 0.5.0 released. upstart-devel mailing list.
Lennart Poettering (2010). sd_notify(). systemd manual pages. Freedesktop.org.
James Hunt (2012-05-01). Upstart service readiness. Ubuntu blueprints. Canonical.
Ian Jackson (2013-12-28) init system dæmon readiness protocol. Debian Bug #733452.
Laurent Bercot (2015). Service startup notifications. s6. skarnet.org

None of these are compatible with one another, the two closest being Ian Jackson's (unimplemented) proposal and Laurent Bercot's proposal that is implemented in s6. The only translation layer currently in existence is Laurent Bercot's sdnotify-wrapper which translates from the s6 protocol to the systemd text message (Type=notify) protocol.

There is also a fairly low adoption rate in the wild, in actual services that are supposed to be speaking these protocols, for even the implemented ones.

The most widely adopted is the Desktop Bus service readiness protocol, implemented in systemd as Type=dbus and proposed by James Hunt (alongside several other protocols, notice) for upstart. The upstart people rejected the idea on the grounds that "there are none that implement this correctly" and "because services don't actually do this in a non-racy manner". Sadly, just like in the case of the upstart people's critique of System 5 rc the specifics of these vague generalizations, explaining where the claimed races are, are once again left entirely to the reader. Unlike in that case, here there are not numerous better critiques from other people to refer to instead, with no-one else making the same claim about Desktop Bus readiness notification.

Adoption limited by deliberate crippling of servers that nominally have adopted the protocols

In some cases, deliberate crippling has had the result of limiting adoption.

From 2010 until 2014, the systemd notification protocol was provided by the systemd authors as (in Lennart Poettering's words from 2011) "drop-in .c sources which projects should simply copy into their source tree" that are "liberally licensed, [and] should compile fine on [even] the most exotic Unixes". However, on non-Linux platforms the code simply compiled to empty functions. The motivation for this on the parts of the systemd authors is clear: to avoid the charges levelled by detractors that "extra systemd code is now in my programs when this isn't even a systemd operating system", enabling them to point out that the "extra systemd code" is a function that returns zero and does nothing else.

Choosing to design based upon such charges has, however, led to the situation where a systemd-compatible system, that speaks the systemd notification protocol on non-systemd Linux operating systems, is a non-starter as an idea. (There's nothing inherent in a client sending a text message down a socket that limits it to systemd Linux operating systems.) The server programs that supposedly can speak the protocol, because their developers have done the recommended thing and used the systemd-author-supplied library, actually do not; and readiness notification the systemd way fails to work.

Ironically, servers that have rolled their own client code for the systemd protocol, such as Pierre-Yves Ritschard's notify_systemd() in collectd and Cameron T Norman's notify_socket(), enable adoption of the protocol where the systemd-author-supplied libraries do not.

The people who rolled their own code also haven't suffered from the systemd authors revoking their MIT copyright licence. In 2014, with much less fanfare than the original announcement of a liberally licensed library for use on even "exotic Unixes", systemd author Kay Sievers changed the MIT copyright licence to a GNU one. (Strictly speaking, that was illegal, as Kay Sievers was not the copyright owner and licensor; Lennart Poettering was.) The code had also changed to no longer be "drop-in .c sources". It is now, rather, a shared library that isn't available for servers to link to outwith Linux systemd operating systems at all.

Security of the service manager

A lot is made of the relative simplicity of implementing the various protocols in the programs for the managed services. Not much is thought about the problems of the manager-side implementation, or of general IPC security good practices.

The service manager is a trusted program that runs with superuser privileges and no security restrictions. It does so because the task of spawning a service involves applying security restrictions and switching to unprivileged accounts, passing through various kinds of one way doors, in a multiplicity of combinations peculiar to individual services (or groups of services). A readiness notification protocol is a client-server mechanism where the service programs are the clients, and the service manager is the server. Moreover, if the clients were trusted, the security restrictions under which they run wouldn't exist in the first place. All of the generally accepted wisdom about client-server interactions between not security restricted servers and untrusted clients thus apply.

not trusting client-supplied data

One such piece of wisdom is not trusting client-supplied data. Clients are potentially compromised or malicious, and can supply erroneous or outright incorrectly structured data. They can send requests to servers that aren't expecting them; they can send large amounts of data for overflowing buffers; they can attempt to hijack existing client sessions; they can do all sorts of things. Service managers have to take this into account, and it affects protocol design.

Avoiding requests from clients other than the intended ones influences the design of the systemd, s6, and Ian Jackson readiness protocols.

In the systemd case, anyone can connect to the notification sockets. They are, after all, designed to be reachable even by dæmons running in chroot environments under the aegises of unprivileged user accounts. systemd requires administrators to determine and to configure which client requests, coming down the notification socket, are not simply read and then discarded by the service manager. In the default setup for services configured to use Type=notify, only the "main" process of a service is recognized as a legitimate client, and only that process can hand over the "mainpid" baton to another process.
The s6 and Jackson protocols instead adopt the approach that the dæmons are not responsible for opening the client ends of the sockets. The service manager does that, arranging for the dæmon to inherit an open file descriptor, and is the only thing that does, and even can, open client sockets. The number of the descriptor is "known" to the service program and is specified to the service manager in the service configuration.

guarding against I/O malfeasance

Restricting the clients doesn't address the whole concern, however. A second piece of wisdom is that servers must guard against the vagaries of the inter-process communication mechanism and the simple acts of client-server I/O. Servers must be careful that clients cannot trigger things like SIGPIPE in the server, at the read end of any pipes that it uses. They must be careful to ensure that clients cannot employ tarpit attacks, where they provide requests very slowly or pause mid-request, to block the service manager or to starve other clients of I/O channels and resources. Even just reading client messages simply in order to discard them requires that a server be written carefully to avoid buffer overflow attacks against the read buffer.

This is part of the motivation for the s6 and Jackson protocols not letting clients that are "strangers" open the client end of the socket at all. It narrows the field of unrecognized clients, whose requests have to be read, from the entire system to just those processes who inherited the open file descriptor. In addition, the s6 protocol is a stream protocol not a packetized one, and a service manager speaking it can read client requests byte by byte, with no potential for buffer overflows from doing so.

avoiding parsing

There's yet more for the service manager to guard against: Even the recognized clients could be compromised. After all, they are themselves servers (the whole point of readiness protocols being to notify the system when a server is fully initialized and ready to serve) who themselves have to not trust client-supplied data.

Which brings us to a third piece of wisdom: Avoid parsing. — famously one of the qmail security principles. A readiness notification protocol is, by its very nature, a program-to-program protocol. Marshalling it into and out of a human-readable format, particularly an only loosely structured one, is pointless, inefficient, and the introduction of extra risk. The risk is that a compromised or malicious client can take advantage of quirks or outright bugs in that marshalling system, with improper quoting, malformed numbers and strings, unexpected zeroes, negative numbers, large numbers, long strings, incorrect data types, injected commands in strings, unexpected character encodings, NULs, and other such things. That risk, moreover, centres inside the code of a privileged service manager process, which is at the very least running with unrestricted security access, and sometimes running as the distinguished process #1 whose abend can have dire consequences (because the kernel abends the whole system in sympathy).

Ironically, a year and bit after this Frequently Given Answer was first published, this risk became a stark reality in systemd. An invalid message sent by a faulty client broke parsing that was being done inside a single privileged process that was both service manager and system manager. A simple zero-length message to the notification socket caused the service manager either to crash completely (CVE-2016-7795) or to cease processing readiness notifications (CVE-2016-7796). Lennart Poettering had removed an error check in the notification message parsing code that filtered out zero-length messages. This left the flow of control to fall through to a later point where an assertion that the message length was non-zero then triggered. That assertion had been added a year earlier by Lennart Poettering with the assumption that this check for zero was in place, as it had been at the time.

Ironically, the faulty client that needed to be guarded against was systemd's own systemd-notify tool, which generated a zero-length message when given a zero-length command-line argument. According to its doco, a zero-length command-line argument should have resulted in a notification message of length 1, containing a single terminating LF byte.

using a synchronous protocol when pulling client credentials

The systemd protocol demonstrates in practice the problems of an asynchronous protocol, where the client does not wait for the notification to be delivered to the server but just sends the message blindly and carries on executing, when part of the protocol involves pulling kernel-maintained information about the client-end process. This is a result of the systemd design painting things into a corner (and also, incidentally, applies to systemd's journalling mechanism as well as to its readiness notification protocol).

The systemd journal is a single point of failure where all log data in the system are funnelled through a single central process. By design, it requires that the systemd journal dæmon be wary of performing synchronous operations, where it blocks waiting for the operation to execute, that might cause log data to be generated. (The daemontools family way of logging, in contrast is to have a distributed system with separate log services for each service, and separate log services for those services and the service and system managers. Laurent Bercot refers to this as a "logging chain". It is a distributed desynchronized system with no centralized synchronized bottleneck through which all log data have to flow.)

The systemd journal dæmon uses the systemd readiness notification protocol. As such, that protocol is forced to be asynchronous in part because of deadlocks in the systemd journal dæmon sending notifications like the one that is described in systemd bug #1505. In theory, this could mean a special notification protocol used just by the systemd journal, as it is the only dæmon where a readiness notification causes the service manager to generate log data that comes back to the dæmon and thus the only dæmon that needs a special case, with all other clients using a regular synchronous form of the protocol. But what is actually implemented is that every protocol client uses the same asynchronous protocol.

This includes systemd-notify, the command-line tool for invoking the protocol from within shell scripts. As was observed back in 2011, systemd-notify, however, directly exposes the design flaw in an asynchronous readiness notification protocol.

It sends the readiness notification message to the service manager, and simply exits, without waiting for the service manager to receive and process the notification. The service manager, in the meantime, is trying to look up information about the process sending the message, in order to determine what service unit it belongs to and thus whether it is authorized to send notification messages. It does this by reading /proc/client-process-ID/cgroup, and parsing the control group membership information that the kernel makes available there, to match the control group up with the services that it knows about. Unluckily, by the time that it gets to reading /proc/client-process-ID/cgroup and parsing the control group membership information, the systemd-notify program has quite often had the chance to run to completion and exit, which causes the kernel to throw away the control group information. The service manager cannot find a relevant control group, and throws the readiness notification away as not coming from a "proper" source, placing the infamous "Cannot find unit for notify message" error in the systemd log.

The result is — as recorded in systemd bug #2737, systemd bug #2739, FreeDesktop bug #75680, RedHat bug #820448, and on the systemd mailing list several times down the years since 2011 — that notifications sent by systemd-notify are unreliable.

© Copyright 2015,2016 Jonathan de Boyne Pollard. "Moral" rights asserted.
Permission is hereby granted to copy and to distribute this web page in its original, unmodified form as long as its last modification datestamp is preserved.