How to add job control to a Unix system
Daniel J. Bernstein
Draft 3, 1991-08-03

Abstract

We describe in detail the steps necessary to add BSD-style job control to any Unix system. In place of the BSD and POSIX rules for controlling ttys, sessions, and process groups, we propose a very simple yet secure mechanism for manipulating process groups alone. This mechanism can also be added to existing BSD systems to provide an alternate, easier-to-use programming interface.

Introduction

Sections 2 through 6 describe selected portions of BSD 4.2 and 4.3 job control. Omitted is any mention of controlling ttys, POSIX sessions, getpgrp() and setpgrp(), TIOCGPGRP and TIOCSPGRP, tcgetpgrp() and tcsetpgrp(), TIOCNOTTY, setsid(), TIOCSCTTY, setpgid(), open() with or without O_NOCTTY, and the relations between all of those and the rest of the job control system, because it turns out that none of that is necessary to provide job control. The attitude in these sections is that of someone faced with a System V variant or a new Unix system (e.g., MINIX) with no job control facilities in the first place, perhaps without even the concept of a controlling tty; the important question is how little work is necessary to add job control features.

Section 7 describes my new, secure, extremely simple job control programming interface [1]. (The interface was inspired by a comment from Chris Torek. It was modified slightly in response to criticism by John Carr. It is dedicated to Marc Teitelbaum.) The interface is enough to let programmers implement a job control shell or any other job control-cognizant applications. It solves all the problems that POSIX sessions were meant to solve, but it is much, much simpler, and can be added to a system with a minimum of effort. It can even be added to a BSD system, as discussed in section 8 — it does not interfere with the old job control model in any way. This will give programmers a choice between the older, more complicated interface and this new, easy-to-use interface.

Section 9 lists several job control programming techniques. Finally, section 10, again from the point of view of a system without job control, lists some common macros and similar cpp-level extensions which make job control programs easier to port.

New kernel structures

Each process "is a member of" a process group. In other words, there's a p_pgrp integer inside struct proc. init starts out in process group 0. Process groups remain the same across fork() and exec().

Each tty has a "foreground" process group. In other words, there's a t_pgrp integer inside struct tty. (Systems where ttys are implemented differently, e.g., via streams, will have to store this information somewhere else.) A tty opened for the first time has t_pgrp set to 0. Each tty also has one extra keyboard character, the suspend character, with a default of 26 (^Z).

There is a new process state (i.e., p_stat value): SSTOP. (ps usually reports this state as T.) When a process is in this state, it gets no CPU time. All signals are blocked until it leaves the state. Note that systems with some form of process tracing (e.g., ptrace(2)) already have SSTOP.

Signals

There are five new signals declared in <signal.h>: SIGSTOP (17), SIGTSTP (18), SIGCONT (19), SIGTTIN (21), SIGTTOU (22). (The numbers in parentheses are the standard BSD values.) Any code which works with bit masks representing signals must be prepared to work with 32-bit masks.

The default action of a process receiving SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU is to stop, i.e., enter the SSTOP state and, as detailed below, to generate a SIGCHLD. SIGSTOP cannot be blocked, caught, or ignored. ("Blocking" refers to any mechanism by which the receipt of a signal is deferred. BSD provides sigblock() and sigsetmask() to manipulate a bit mask of blocked signals. On systems without a similar mechanism, SIGSTOP obviously can't be blocked in the first place. What's important is that SIGSTOP always take effect immediately.)

Any process which receives SIGCONT will continue, i.e., leave the SSTOP state; this is in addition to any signal handler installed. (Obviously the process cannot execute a signal handler if it's in the SSTOP state, receiving no CPU time!) SIGCONT cannot be blocked. A process is always able to send SIGCONT to any of its children, regardless of permission checks. (BSD actually lets you send SIGCONT to any descendant. Some popular BSD variants do not obey this rule.)

When a process enters the SSTOP state, it generates a SIGCHLD (aka SIGCLD) to its parent. There are several conflicting sets of semantics for SIGCHLD/SIGCLD (e.g., what happens when it's ignored? when are zombies created?) on various systems, none of which have any relevance to job control.

Waiting

When the parent, either upon receiving a SIGCHLD or at any other time, does a wait(), it will not see any stopped children — i.e., job control doesn't change the semantics of wait(). (Process tracing does, but that is also irrelevant to job control.) There is a new system call, wait2, which lets the parent see stopped children:

#include *lt;sys/wait.h>

int wait2(status,options)
int *status;
int options;

(In fact, BSD has a wait3() call instead of wait2(); the above call is the same as wait3(status,options,(struct rusage *) 0). See section 10 for further details.) options is a bit field. You have to define two bits, WNOHANG and WUNTRACED, in <sys/wait.h>, for use as options.

Normally wait2() acts like wait(): it blocks waiting for a child to die and then returns the dead pid, or returns -1 immediately if there are no live children. If options includes WNOHANG, wait2() will return 0 immediately instead of blocking. If options includes WUNTRACED, wait2() will return the pid of a stopped child as well as the pid of a dead child. (By far the most common options value is WNOHANG | WUNTRACED.)

As usual, when wait2() returns a pid, status says what's happened to that pid. This is a bit more complicated than before because status also has to tell the parent what happened if the child stopped. Here's the whole story: If the low 7 bits are all set, the child has in fact stopped. If none of those bits are set, the child has exited normally. Otherwise the child has been terminated by a signal, and those 7 bits say which signal it was. (If the 8th bit is set in that case, the child dumped core.) If the child has stopped, the 8th bit is 0, and the 8 bits after that say which signal (SIGTTOU, for instance) stopped the process. If the child has exited, the 8th bit is 0, and the 8 bits after that give its exit code mod 256.

Terminal-generated signals

When the interrupt character (typically ^C under BSD, DEL under System V) is typed on a terminal in cooked mode, if the terminal's foreground process group is non-zero, every process in that process group is sent SIGINT. Similarly, the quit character (typically ^\ under BSD) generates SIGQUIT. If a terminal is "hung up", it generates SIGHUP. Job control needs one extra signal so that the user can tell the current process to stop: namely, the suspend character mentioned above (typically ^Z), which generates SIGTSTP. Notice that if a user could set his tty's process group arbitrarily, he could send all sorts of signals to any processes in those process groups. So it is important for security that tty process groups be controlled.

The suspend character is the first user interface aspect of job control mentioned so far. Typically the processes stop (though they can catch SIGTSTP and do something else). A job-control shell then receives the SIGCHLD and, with wait2(), sees that its children have stopped. It can report this to the user and present a new prompt. The user can then start more processes, or, with an fg (foreground) command, tell the shell to send SIGCONT to the children so that they start up again.

Programs can inspect and set the suspend character with two new tty ioctls: TIOCGLTC and TIOCSLTC, both defined in <sys/ioctl.>. In both cases the argument points to a struct ltchars (defined in the same place), which contains a char t_suspc specifying the suspend character.

As a matter of fact, under BSD there are several other local terminal characters (that's what ltchars stands for), notably t_dsuspc. The delayed suspend character (typically ^Y) is supposed to act like the suspend character but only when a process actually reads it. However, several operating system releases from Sun simply don't do this. They pass dsusp through like any other character. Given that almost nobody ever notices this bug, let alone complains about it, I don't think there's any point in bothering to implement the character.

I/O-generated signals

There's another side to the job control user interface: namely, several processes (or pipelines — in general, "jobs") can read and write the tty at once. The job-control shell places each pipeline into a separate process group, and when any job except the foreground job reads from the tty, it is stopped until the user decides to give it input. This is much more flexible than cutting background processes off from the tty permanently, as non-job-control shells do.

More precisely, if a process reads from a tty, and its process group is not the foreground process group of the tty, then its process group is sent a SIGTTIN signal. As an exception, if that process is blocking or ignoring SIGTTIN, no signal is generated. Instead, the read returns -1 with errno of EIO. "Reading" here includes only read(), not the various tty ioctls which inspect tty structures; while there are some benefits of generating SIGTTIN for the latter, this turns out to be too restrictive for many applications. (There is an ioctl, TIOCSTI, which is also lumped with "reading", but a full discussion of TIOCSTI would be too long for this paper. It's not an important enough ioctl to bother with.)

If a process writes to a tty, and its process group is not the foreground process group of the tty, then its process group is sent a SIGTTOU signal. As an exception, if that process is blocking or ignoring SIGTTOU, no signal is generated and it is allowed to produce output. This time, "write" includes not only write() but also any other operations which affect the tty in any way. (Under BSD there is a tty mode, LTOSTOP, which when disabled turns off SIGTTOU for write() but not for other operations. This is not absolutely necessary, but if you have any free time you should implement stty tostop to turn LTOSTOP on and stty -tostop to turn it off. The internal interface is unimportant as long as the user can select his favorite behavior.)

None of the above apply to operations by a process in process group 0. Process group 0 must never, ever, be sent I/O-generated signals. The simplest course of action here is to let all operations from process group 0 succeed. (What actually happens in this case isn't too important, as long as processes like getty can open a tty and start programs on the tty. Most BSD-derived systems set process group to pid when a process in process group 0 opens a tty; this behavior is not necessary. Note that if a process in process group 0 reads from a tty while a shell is still reading from it, the two read()s will compete for terminal input.)

Notice that if a process can join an arbitrary process group, it can cause SIGTTOU and SIGTTIN to be sent to other process. So it's important for security that processes' process groups be controlled.

Be careful in implementing I/O-generated signals that you test repeatedly for the right process group. The process could easily receive SIGCONT while the tty is in a different group. In that case it should immediately stop the process group again (without even executing a SIGCONT handler!), generate another SIGCHLD, and wait for the next SIGCONT. This can repeat any number of times. Only when the tty is in the right process group should the operation succeed.

A new, secure, simple job control programming interface

The process group calls described in this section are, unlike the job control features described in sections 2 through 6, not part of BSD, though they do not interfere with BSD. There are a total of three calls which manipulate process groups: tcnewpgrp(), settpgrp(), tctpgrp(). Throughout this section, fdtty is a file descriptor pointing to a terminal.

If fdtty has write access, tcnewpgrp(fdtty) should allocate an unused process group and set the terminal's foreground process group to that new process group. This is a write operation and should produce SIGTTOU if this process is not in the foreground (and is not ignoring the signal, etc.). tcnewpgrp returns 0 on success, -1 with errno ENOTTY if fdtty is not a terminal, -1 with errno EBADF if fdtty is not open for writing.

If fdtty has read access, settpgrp(fdtty) should set this process's process group to the foreground process group of the terminal. As a special case, settpgrp(-1) sets this process's process group to 0, so that it is exempt from job control. The latter is redundant — a process can just as easily create a process group for itself, fork, and hide the child away inside that group — but convenient. settpgrp returns 0 on success, -1 with errno ENOTTY if fdtty is not -1 and not a terminal, -1 with errno EBADF if fdtty is not open for reading.

If fdtty has write access, and pid is the current process or a child of the current process, tctpgrp(fdtty,pid) should set the terminal's foreground process group to the process group of pid. This is a write operation. You may want to allow pid to be any descendant of the current process (under BSD this simplifies the implementation), but this is not necessary for a job control shell, and nobody is going to depend on that behavior. tctpgrp returns 0 on success, -1 with errno ENOTTY if fdtty is not a terminal, -1 with errno ESRCH if pid does not exist, -1 with errno EPERM if pid exists but is not a child/descendant, -1 with errno EBADF if fdtty is not open for reading.

To implement tcnewpgrp() you need to set up a table (I recommend a chained hash table) of structures containing process group number and reference count. The reference count is the total number of processes and ttys with that process group. tcnewpgrp() can then search for a process group not in the table. The range of process group numbers is not important; a good choice for BSD systems is 32801-65000. However, it is important that there be more process groups available than the maximum possible number of ttys and pids in use at once.

Whenever a process is created, the reference count for its process group (if that group is not 0) must be incremented; whenever a process dies, the reference count for its process group (if that group is not 0) must be decremented; whenever a process changes process groups (e.g., via settpgrp()), the reference counts for old and new groups must be set appropriately; and whenever a tty changes process groups (e.g., via tcnewpgrp() or tctpgrp()), the reference counts must also be set appropriately. That's it.

A different implementation strategy has been suggested by John Carr: the system can simply assign group numbers in increasing order starting from boot time. If, for instance, a process group has 64 bits, and there are at most a billion process group manipulations per second, it will be more than 584 years before the numbers can repeat. Naturally, system administrators should keep a close eye on recently allocated process groups, and be prepared to bring the system down for maintenance as soon as there is any risk of repetition.

These three process group manipulation calls do not allow any abuse. To set a terminal to someone else's (nonzero) process group with tctpgrp(), an attacker would need a child process already in the group. But to put a process into someone else's (nonzero) process group with settpgrp(), an attacker would already need access to a tty with that group! There's no way to break into this circle. tcnewpgrp() is useless for attacks since it does not let an attacker join an existing group. Hence the system is secure. Together with the basic job control features outlined in sections 2 through 6, this provides a complete, usable job control system.

For comparison, BSD job control involves controlling ttys, and has six interface functions beyond the mechanisms mentioned in sections 2 through 6: open() (of a tty), setpgrp(), getpgrp(), the TIOCGPGRP ioctl, the TIOCSPGRP ioctl, and the TIOCNOTTY ioctl. Controlling terminals affect the entire job control system and make everything harder to program and use.

POSIX job control is even worse: it includes not only the entire complexity of the BSD interface, but it has "sessions" with effects even more pervasive than those of controlling terminals. (For instance, a process can only be stopped if its parent is in the same session but a different process group.)

Implementing the new job control interface in a BSD system

tcnewpgrp() requires kernel changes on any system; current systems do not recognize a range of process groups to be dynamically allocated to ttys. It also allows a style of job-control programming somewhat different from the usual BSD style. However, settpgrp() and tctpgrp() can be implemented as library routines under BSD. Here they are:

int settpgrp(fdtty)
int fdtty;
{
 int pgrp = 0;
 if (fdtty != -1)
   if (ioctl(fdtty,TIOCGPGRP,&pgrp) == -1)
     return -1;
 return setpgrp(0,pgrp);
}

int tctpgrp(fdtty,pid)
int fdtty;
int pid;
{
 int pgrp;
 if ((pgrp = getpgrp(pid)) == -1)
   return -1;
 return ioctl(fdtty,TIOCSPGRP,&pgrp);
}

Note that this interface doesn't interact with controlling ttys in any way. Unfortunately, controlling ttys sometimes force their own interactions, and a job control application which manipulates ttys (as opposed to a shell, which merely runs under a single tty) should still be aware of the old controlling tty rules. The same is true in far greater measure under POSIX — you simply cannot ignore sessions, because you will open up rather large security holes if you leave all processes in the same session. Put simply, the POSIX standard forces system code to manipulate sessions for its health.

Programming common operations with the new job control interface

Forking a pipeline in a job-control shell: The shell starts with tcnewpgrp(fdtty), so that the tty is in the new process group before there are even any children. (That's the basic difference between the BSD and POSIX models and this one.) It then forks each process in the pipeline. Each process does settpgrp(fdtty), thus joining the new process group, before it exec()s the appropriate program. Note that to avoid races the shell should block SIGCHLD while it's spawning children.

Handling a stopped child process: When the shell sees that a pipeline has stopped or exited, it does tctpgrp(fdtty,getpid()) to set the tty to its own process group. Note that it has to ignore SIGTTOU during this operation. To resume the pipeline it does tctpgrp(fdtty,pid) where pid is any one of the child processes, then sends SIGCONT to the process group.

Starting a process under a new tty: When, for instance, telnetd or init/getty or another program in process group 0 wants to grab a tty, it opens the tty and forks a child process. The child does tcnewpgrp(fdtty) to give the tty a real process group, then settpgrp(fdtty) to place itself into the foreground.

Changing ttys: Despite what POSIX would have you believe with its session straitjacket rules, people do run programs all the time under a different tty from the shell. The most common example in BSD is probably the script program; other examples are emacs, screen, pty, mtty, atty. Fortunately, exactly the same procedure works as in the previous example.

Dissociating a daemon: Note that dissociating from a tty is a controlling-terminal concept. However, most daemons do want to place themselves into process groups of their own, so that they are not affected by job-control signals. This can be handled in several ways, but by far the easiest is settpgrp(-1) to join process group 0. (Note that under BSD there is no reliable way to dissociate from a controlling tty — the TIOCEXCL ioctl can prevent dissociation. That is not the mark of a clean interface.)

Forcing oneself into the foreground: Most programs which manipulate the tty, usually so that they can run in character mode, don't work correctly with job control. The usual sequence after startup is this: read tty modes; write new tty modes including noecho and cbreak. The problem is that the process could be in the background when it reads the tty modes — a different program, which itself changes the tty modes to something strange, could be in the foreground. This process will read the strange modes, then stop when it tries to set the modes. Later it is restarted and runs without trouble — but when it exits, it will "restore" the tty to those strange modes it started with. To avoid this bug, processes which manipulate the tty should force themselves into the foreground before reading or writing anything. An easy way to do this is tctpgrp(getpid()), with the default SIGTTOU handler. Note that the program should also do this upon continuing after a stop — otherwise it might make the same mistake of reading modes before it knows it's in the foreground.

Helpful extensions to the job control system

There are several steps you can take which don't extend the job control interface but which do make job control programs more portable or easier to read.

As noted above, BSD has a wait3() call instead of wait2(). It is called as follows:

#include <sys/wait.h>
#include <sys/time.h>
#include <sys/resource.h>

int wait3(status,options,rusage)
int *status;
int options;
struct rusage *rusage;

If rusage is NULL, this is just like wait2(). <sys/time.h> can simply include <time.h>. (Under BSD it defines several system time structures, like struct timeval.) <sys/resource.h> doesn't need to provide any information other than a definition of struct rusage. (Under BSD, if a child exits and the parent provides a non-NULL rusage pointer to wait3(), the structure is filled in with information about the resources used by the child [and its children, and so on]. For instance, ru_nsignals is the number of signals received. This is very open-ended and absolutely irrelevant to job control.) If you are adding job control to a system without it and want to provide the wait3() call, just define struct rusage { int dummy; }. While a job-control shell can make good use of resource information, most uses of wait3() really don't need the third argument. However, there are enough programs which include <sys/time.h> and <sys/resource.h> and use wait3() that it is worthwhile to provide the extra interface.

Another extension is to define a union wait type in <sys/wait.h> with an int w_status member. At some point BSD left the beaten path and decided that wait() should use a union wait instead of an int to return status information. This decision is generally regarded as a mistake if only because it severely hampers portability, but there are quite a few programs which depend on it, and there's no harm in supporting it.

More useful is to define a set of macros which extract information from a wait status. (Under BSD, union wait actually contains structure members which encode the same information. However, the macros are easier to use and support.) Here are the important ones:

#define WIFSTOPPED(s) (((s)&  0177) == 0177)
#define WIFEXITED(s) (!((s)&  0177))
#define WIFSIGNALED(s) (0176>  (unsigned) (((s)&  0177) - 1))
#define WSTOPSIG(s) ((s)>>  8) /* only defined if WIFSTOPPED */
#define WEXITSTATUS(s) ((s)>>  8) /* only defined if WIFEXITED */
#define WTERMSIG(s) ((s)&  0177) /* only defined if WIFSIGNALED */
#define WCOREDUMP(s) ((s)&  0200) /* only defined if WIFSIGNALED */

On some 16-bit machines the >>8 may have to be changed, or (s) may have to be cast to unsigned. These macros are meant to be applied to an int, not a union wait; most compilers will do the right thing anyway, but be careful.

Acknowledgments

Thanks to Chris Torek, Christos S. Zoulas, and David J. MacKenzie for their comments. Thanks also to John F. Haugh for a series of questions which pointed out that, somewhere in this paper, I should emphasize that an "unused" process group is one which doesn't appear in any t_pgrp or p_pgrp. (There, I said it.)

References

[1] D. J. Bernstein, "A new, secure, extremely simple job control interface," article <18072.Jul1707.06.4191@kramden.acf.nyu.edu>, comp.unix.wizards, July 1991.

© Copyright 1991 Daniel J. Bernstein. Presumably.