We describe in detail the steps necessary to add BSD-style job control to any Unix system. In place of the BSD and POSIX rules for controlling ttys, sessions, and process groups, we propose a very simple yet secure mechanism for manipulating process groups alone. This mechanism can also be added to existing BSD systems to provide an alternate, easier-to-use programming interface.
Sections 2 through 6 describe selected portions of BSD 4.2 and 4.3 job
control. Omitted is any mention of controlling ttys, POSIX sessions,
getpgrp()
and setpgrp()
, TIOCGPGRP
and TIOCSPGRP
, tcgetpgrp()
and
tcsetpgrp()
, TIOCNOTTY
, setsid()
,
TIOCSCTTY
, setpgid()
, open()
with
or without O_NOCTTY
, and the relations between all of those
and the rest of the job control system, because it turns out that none of
that is necessary to provide job control. The attitude in these sections
is that of someone faced with a System V variant or a new Unix system
(e.g., MINIX) with no job control facilities in the first place, perhaps
without even the concept of a controlling tty; the important question is
how little work is necessary to add job control features.
Section 7 describes my new, secure, extremely simple job control programming interface [1]. (The interface was inspired by a comment from Chris Torek. It was modified slightly in response to criticism by John Carr. It is dedicated to Marc Teitelbaum.) The interface is enough to let programmers implement a job control shell or any other job control-cognizant applications. It solves all the problems that POSIX sessions were meant to solve, but it is much, much simpler, and can be added to a system with a minimum of effort. It can even be added to a BSD system, as discussed in section 8 — it does not interfere with the old job control model in any way. This will give programmers a choice between the older, more complicated interface and this new, easy-to-use interface.
Section 9 lists several job control programming techniques. Finally, section 10, again from the point of view of a system without job control, lists some common macros and similar cpp-level extensions which make job control programs easier to port.
Each process "is a member of" a process group. In other words, there's a
p_pgrp
integer inside struct proc
.
init
starts out in process group 0. Process groups remain
the same across fork()
and exec()
.
Each tty has a "foreground" process group. In other words, there's a
t_pgrp
integer inside struct tty
. (Systems
where ttys are implemented differently, e.g., via streams, will have to
store this information somewhere else.) A tty opened for the first time
has t_pgrp
set to 0. Each tty also has one extra keyboard
character, the suspend character, with a default of 26 (^Z).
There is a new process state (i.e., p_sta
t value):
SSTOP
. (ps
usually reports this state as T.)
When a process is in this state, it gets no CPU time. All signals are
blocked until it leaves the state. Note that systems with some form of
process tracing (e.g., ptrace
(2)) already have
SSTOP
.
There are five new signals declared in <signal.h>:
SIGSTOP
(17), SIGTSTP
(18), SIGCONT
(19), SIGTTIN
(21), SIGTTOU
(22). (The numbers
in parentheses are the standard BSD values.) Any code which works with
bit masks representing signals must be prepared to work with 32-bit masks.
The default action of a process receiving SIGSTOP
,
SIGTSTP
, SIGTTIN
, or SIGTTOU
is to
stop, i.e., enter the SSTOP
state and, as detailed below, to
generate a SIGCHLD
. SIGSTOP
cannot be blocked,
caught, or ignored. ("Blocking" refers to any mechanism by which the
receipt of a signal is deferred. BSD provides sigblock()
and
sigsetmask()
to manipulate a bit mask of blocked signals. On
systems without a similar mechanism, SIGSTOP
obviously can't
be blocked in the first place. What's important is that
SIGSTOP
always take effect immediately.)
Any process which receives SIGCONT
will continue, i.e., leave
the SSTOP
state; this is in addition to any signal handler
installed. (Obviously the process cannot execute a signal handler if it's
in the SSTOP
state, receiving no CPU time!)
SIGCONT
cannot be blocked. A process is always able to send
SIGCONT
to any of its children, regardless of permission
checks. (BSD actually lets you send SIGCONT
to any
descendant. Some popular BSD variants do not obey this rule.)
When a process enters the SSTOP
state, it generates a
SIGCHLD
(aka SIGCLD
) to its parent. There are
several conflicting sets of semantics for
SIGCHLD
/SIGCLD
(e.g., what happens when it's
ignored? when are zombies created?) on various systems, none of which have
any relevance to job control.
When the parent, either upon receiving a SIGCHLD
or at any
other time, does a wait()
, it will not see any stopped
children — i.e., job control doesn't change the semantics of
wait()
. (Process tracing does, but that is also irrelevant
to job control.) There is a new system call, wait2
, which
lets the parent see stopped children:
#include *lt;sys/wait.h> int wait2(status,options) int *status; int options;
(In fact, BSD has a wait3()
call instead of
wait2()
; the above call is the same as
wait3(status,options,(struct rusage *) 0)
. See section 10
for further details.) options
is a bit field. You have to
define two bits, WNOHANG
and WUNTRACED
, in
<sys/wait.h>, for use as options.
Normally wait2()
acts like wait()
: it blocks
waiting for a child to die and then returns the dead pid, or returns -1
immediately if there are no live children. If options includes
WNOHANG
, wait2()
will return 0 immediately
instead of blocking. If options includes WUNTRACED
,
wait2()
will return the pid of a stopped child as well as the
pid of a dead child. (By far the most common options value is
WNOHANG | WUNTRACED
.)
As usual, when wait2()
returns a pid, status
says what's happened to that pid. This is a bit more complicated than
before because status
also has to tell the parent what
happened if the child stopped. Here's the whole story: If the low 7 bits
are all set, the child has in fact stopped. If none of those bits are
set, the child has exited normally. Otherwise the child has been
terminated by a signal, and those 7 bits say which signal it was. (If the
8th bit is set in that case, the child dumped core.) If the child has
stopped, the 8th bit is 0, and the 8 bits after that say which signal
(SIGTTOU
, for instance) stopped the process. If the child
has exited, the 8th bit is 0, and the 8 bits after that give its exit code
mod 256.
When the interrupt character (typically ^C under BSD, DEL under System V)
is typed on a terminal in cooked mode, if the terminal's foreground
process group is non-zero, every process in that process group is sent
SIGINT
. Similarly, the quit character (typically ^\ under
BSD) generates SIGQUIT
. If a terminal is "hung up", it
generates SIGHUP
. Job control needs one extra signal so that
the user can tell the current process to stop: namely, the suspend
character mentioned above (typically ^Z), which generates
SIGTSTP
. Notice that if a user could set his tty's process
group arbitrarily, he could send all sorts of signals to any processes in
those process groups. So it is important for security that tty process
groups be controlled.
The suspend character is the first user interface aspect of job control
mentioned so far. Typically the processes stop (though they can catch
SIGTSTP
and do something else). A job-control shell then
receives the SIGCHLD
and, with wait2()
, sees
that its children have stopped. It can report this to the user and
present a new prompt. The user can then start more processes, or, with an
fg
(foreground) command, tell the shell to send
SIGCONT
to the children so that they start up again.
Programs can inspect and set the suspend character with two new tty
ioctls: TIOCGLTC
and TIOCSLTC
, both defined in
<sys/ioctl.>. In both cases the argument points to a struct
ltchars
(defined in the same place), which contains a char
t_suspc
specifying the suspend character.
As a matter of fact, under BSD there are several other local terminal
characters (that's what ltchars
stands for), notably
t_dsuspc
. The delayed suspend character (typically ^Y) is
supposed to act like the suspend character but only when a process
actually reads it. However, several operating system releases from Sun
simply don't do this. They pass dsusp through like any other character.
Given that almost nobody ever notices this bug, let alone complains about
it, I don't think there's any point in bothering to implement the
character.
There's another side to the job control user interface: namely, several processes (or pipelines — in general, "jobs") can read and write the tty at once. The job-control shell places each pipeline into a separate process group, and when any job except the foreground job reads from the tty, it is stopped until the user decides to give it input. This is much more flexible than cutting background processes off from the tty permanently, as non-job-control shells do.
More precisely, if a process reads from a tty, and its process group is
not the foreground process group of the tty, then its process group is
sent a SIGTTIN
signal. As an exception, if that process is
blocking or ignoring SIGTTIN
, no signal is generated.
Instead, the read returns -1 with errno
of EIO
.
"Reading" here includes only read()
, not the various tty
ioctl
s which inspect tty structures; while there are some
benefits of generating SIGTTIN
for the latter, this turns out
to be too restrictive for many applications. (There is an ioctl,
TIOCSTI
, which is also lumped with "reading", but a full
discussion of TIOCSTI
would be too long for this paper. It's
not an important enough ioctl to bother with.)
If a process writes to a tty, and its process group is not the foreground
process group of the tty, then its process group is sent a
SIGTTOU
signal. As an exception, if that process is blocking
or ignoring SIGTTOU
, no signal is generated and it is allowed
to produce output. This time, "write" includes not only
write()
but also any other operations which affect the tty in
any way. (Under BSD there is a tty mode, LTOSTOP
, which when
disabled turns off SIGTTOU
for write()
but not
for other operations. This is not absolutely necessary, but if you have
any free time you should implement stty tostop
to turn
LTOSTOP
on and stty -tostop
to turn it off. The
internal interface is unimportant as long as the user can select his
favorite behavior.)
None of the above apply to operations by a process in process group 0.
Process group 0 must never, ever, be sent I/O-generated signals. The
simplest course of action here is to let all operations from process group
0 succeed. (What actually happens in this case isn't too important, as
long as processes like getty
can open a tty and start
programs on the tty. Most BSD-derived systems set process group to pid
when a process in process group 0 opens a tty; this behavior is not
necessary. Note that if a process in process group 0 reads from a tty
while a shell is still reading from it, the two read()
s will
compete for terminal input.)
Notice that if a process can join an arbitrary process group, it can cause
SIGTTOU
and SIGTTIN
to be sent to other process.
So it's important for security that processes' process groups be
controlled.
Be careful in implementing I/O-generated signals that you test repeatedly
for the right process group. The process could easily receive
SIGCONT
while the tty is in a different group. In that case
it should immediately stop the process group again (without even executing
a SIGCONT
handler!), generate another SIGCHLD
,
and wait for the next SIGCONT
. This can repeat any number of
times. Only when the tty is in the right process group should the
operation succeed.
The process group calls described in this section are, unlike the job
control features described in sections 2 through 6, not part of BSD,
though they do not interfere with BSD. There are a total of three calls
which manipulate process groups: tcnewpgrp()
,
settpgrp()
, tctpgrp()
. Throughout this section,
fdtty
is a file descriptor pointing to a terminal.
If fdtty
has write access, tcnewpgrp(fdtty)
should allocate an unused process group and set the terminal's foreground
process group to that new process group. This is a write operation and
should produce SIGTTOU
if this process is not in the
foreground (and is not ignoring the signal, etc.). tcnewpgrp
returns 0 on success, -1 with errno ENOTTY
if
fdtty
is not a terminal, -1 with errno EBADF
if
fdtty
is not open for writing.
If fdtty
has read access, settpgrp(fdtty)
should
set this process's process group to the foreground process group of the
terminal. As a special case, settpgrp(-1)
sets this
process's process group to 0, so that it is exempt from job control. The
latter is redundant — a process can just as easily create a process
group for itself, fork, and hide the child away inside that group —
but convenient. settpgrp
returns 0 on success, -1 with
errno ENOTTY
if fdtty
is not -1 and not a
terminal, -1 with errno EBADF
if fdtty
is not
open for reading.
If fdtty
has write access, and pid is the current process or
a child of the current process, tctpgrp(fdtty,pid)
should set
the terminal's foreground process group to the process group of pid. This
is a write operation. You may want to allow pid to be any descendant of
the current process (under BSD this simplifies the implementation), but
this is not necessary for a job control shell, and nobody is going to
depend on that behavior. tctpgrp
returns 0 on success, -1
with errno ENOTTY
if fdtty
is not a terminal, -1
with errno ESRCH
if pid does not exist, -1 with errno
EPERM
if pid exists but is not a child/descendant, -1 with
errno EBADF
if fdtty
is not open for reading.
To implement tcnewpgrp()
you need to set up a table (I
recommend a chained hash table) of structures containing process group
number and reference count. The reference count is the total number of
processes and ttys with that process group. tcnewpgrp()
can
then search for a process group not in the table. The range of process
group numbers is not important; a good choice for BSD systems is
32801-65000. However, it is important that there be more process groups
available than the maximum possible number of ttys and pids in use at
once.
Whenever a process is created, the reference count for its process group
(if that group is not 0) must be incremented; whenever a process dies, the
reference count for its process group (if that group is not 0) must be
decremented; whenever a process changes process groups (e.g., via
settpgrp()
), the reference counts for old and new groups must
be set appropriately; and whenever a tty changes process groups (e.g., via
tcnewpgrp()
or tctpgrp()
), the reference counts
must also be set appropriately. That's it.
A different implementation strategy has been suggested by John Carr: the system can simply assign group numbers in increasing order starting from boot time. If, for instance, a process group has 64 bits, and there are at most a billion process group manipulations per second, it will be more than 584 years before the numbers can repeat. Naturally, system administrators should keep a close eye on recently allocated process groups, and be prepared to bring the system down for maintenance as soon as there is any risk of repetition.
These three process group manipulation calls do not allow any abuse. To
set a terminal to someone else's (nonzero) process group with
tctpgrp()
, an attacker would need a child process already in
the group. But to put a process into someone else's (nonzero) process
group with settpgrp()
, an attacker would already need access
to a tty with that group! There's no way to break into this circle.
tcnewpgrp()
is useless for attacks since it does not let an
attacker join an existing group. Hence the system is secure. Together
with the basic job control features outlined in sections 2 through 6, this
provides a complete, usable job control system.
For comparison, BSD job control involves controlling ttys, and has six
interface functions beyond the mechanisms mentioned in sections 2 through
6: open()
(of a tty), setpgrp()
,
getpgrp()
, the TIOCGPGRP
ioctl, the
TIOCSPGRP
ioctl, and the TIOCNOTTY
ioctl.
Controlling terminals affect the entire job control system and make
everything harder to program and use.
POSIX job control is even worse: it includes not only the entire complexity of the BSD interface, but it has "sessions" with effects even more pervasive than those of controlling terminals. (For instance, a process can only be stopped if its parent is in the same session but a different process group.)
tcnewpgrp()
requires kernel changes on any system; current
systems do not recognize a range of process groups to be dynamically
allocated to ttys. It also allows a style of job-control programming
somewhat different from the usual BSD style. However,
settpgrp()
and tctpgrp()
can be implemented as
library routines under BSD. Here they are:
int settpgrp(fdtty) int fdtty; { int pgrp = 0; if (fdtty != -1) if (ioctl(fdtty,TIOCGPGRP,&pgrp) == -1) return -1; return setpgrp(0,pgrp); } int tctpgrp(fdtty,pid) int fdtty; int pid; { int pgrp; if ((pgrp = getpgrp(pid)) == -1) return -1; return ioctl(fdtty,TIOCSPGRP,&pgrp); }
Note that this interface doesn't interact with controlling ttys in any way. Unfortunately, controlling ttys sometimes force their own interactions, and a job control application which manipulates ttys (as opposed to a shell, which merely runs under a single tty) should still be aware of the old controlling tty rules. The same is true in far greater measure under POSIX — you simply cannot ignore sessions, because you will open up rather large security holes if you leave all processes in the same session. Put simply, the POSIX standard forces system code to manipulate sessions for its health.
Forking a pipeline in a job-control shell: The shell starts with
tcnewpgrp(fdtty)
, so that the tty is in the new process group
before there are even any children. (That's the basic difference between
the BSD and POSIX models and this one.) It then forks each process in the
pipeline. Each process does settpgrp(fdtty)
, thus joining
the new process group, before it exec()
s the appropriate
program. Note that to avoid races the shell should block
SIGCHLD
while it's spawning children.
Handling a stopped child process: When the shell sees that a pipeline has
stopped or exited, it does tctpgrp(fdtty,getpid())
to set the
tty to its own process group. Note that it has to ignore
SIGTTOU
during this operation. To resume the pipeline it
does tctpgrp(fdtty,pid)
where pid is any one of the child
processes, then sends SIGCONT
to the process group.
Starting a process under a new tty: When, for instance,
telnetd
or init
/getty
or another
program in process group 0 wants to grab a tty, it opens the tty and forks
a child process. The child does tcnewpgrp(fdtty)
to give the
tty a real process group, then settpgrp(fdtty)
to place
itself into the foreground.
Changing ttys: Despite what POSIX would have you believe with its session
straitjacket rules, people do run programs all the time under a different
tty from the shell. The most common example in BSD is probably the
script
program; other examples are emacs
,
screen
, pty
, mtty
,
atty
. Fortunately, exactly the same procedure works as in
the previous example.
Dissociating a daemon: Note that dissociating from a tty is a
controlling-terminal concept. However, most daemons do want to place
themselves into process groups of their own, so that they are not affected
by job-control signals. This can be handled in several ways, but by far
the easiest is settpgrp(-1)
to join process group 0. (Note
that under BSD there is no reliable way to dissociate from a controlling
tty — the TIOCEXCL
ioctl can prevent dissociation.
That is not the mark of a clean interface.)
Forcing oneself into the foreground: Most programs which manipulate the
tty, usually so that they can run in character mode, don't work correctly
with job control. The usual sequence after startup is this: read tty
modes; write new tty modes including noecho
and
cbreak
. The problem is that the process could be in the
background when it reads the tty modes — a different program, which
itself changes the tty modes to something strange, could be in the
foreground. This process will read the strange modes, then stop when it
tries to set the modes. Later it is restarted and runs without trouble
— but when it exits, it will "restore" the tty to those strange
modes it started with. To avoid this bug, processes which manipulate the
tty should force themselves into the foreground before reading or writing
anything. An easy way to do this is tctpgrp(getpid())
, with
the default SIGTTOU
handler. Note that the program should
also do this upon continuing after a stop — otherwise it might make
the same mistake of reading modes before it knows it's in the foreground.
There are several steps you can take which don't extend the job control interface but which do make job control programs more portable or easier to read.
As noted above, BSD has a wait3()
call instead of
wait2()
. It is called as follows:
#include <sys/wait.h> #include <sys/time.h> #include <sys/resource.h> int wait3(status,options,rusage) int *status; int options; struct rusage *rusage;
If rusage
is NULL
, this is just like
wait2()
. <sys/time.h> can simply include
<time.h>. (Under BSD it defines several system time structures,
like struct timeval
.) <sys/resource.h> doesn't need to
provide any information other than a definition of struct
rusage
. (Under BSD, if a child exits and the parent provides a
non-NULL rusage
pointer to wait3()
, the
structure is filled in with information about the resources used by the
child [and its children, and so on]. For instance,
ru_nsignals
is the number of signals received. This is very
open-ended and absolutely irrelevant to job control.) If you are adding
job control to a system without it and want to provide the
wait3()
call, just define struct rusage { int dummy;
}
. While a job-control shell can make good use of resource
information, most uses of wait3()
really don't need the third
argument. However, there are enough programs which include
<sys/time.h> and <sys/resource.h> and use wait3()
that it is worthwhile to provide the extra interface.
Another extension is to define a union wait
type in
<sys/wait.h> with an int w_status
member. At some
point BSD left the beaten path and decided that wait()
should
use a union wait
instead of an int to return status
information. This decision is generally regarded as a mistake if only
because it severely hampers portability, but there are quite a few
programs which depend on it, and there's no harm in supporting it.
More useful is to define a set of macros which extract information from a
wait status. (Under BSD, union wait
actually contains
structure members which encode the same information. However, the macros
are easier to use and support.) Here are the important ones:
#define WIFSTOPPED(s) (((s)& 0177) == 0177) #define WIFEXITED(s) (!((s)& 0177)) #define WIFSIGNALED(s) (0176> (unsigned) (((s)& 0177) - 1)) #define WSTOPSIG(s) ((s)>> 8) /* only defined if WIFSTOPPED */ #define WEXITSTATUS(s) ((s)>> 8) /* only defined if WIFEXITED */ #define WTERMSIG(s) ((s)& 0177) /* only defined if WIFSIGNALED */ #define WCOREDUMP(s) ((s)& 0200) /* only defined if WIFSIGNALED */
On some 16-bit machines the >>8
may have to be changed,
or (s) may have to be cast to unsigned. These macros are meant to be
applied to an int, not a union wait
; most compilers will do
the right thing anyway, but be careful.
Thanks to Chris Torek, Christos S. Zoulas, and David J. MacKenzie for
their comments. Thanks also to John F. Haugh for a series of questions
which pointed out that, somewhere in this paper, I should emphasize that
an "unused" process group is one which doesn't appear in any
t_pgrp
or p_pgrp
. (There, I said it.)
© Copyright 1991 Daniel J. Bernstein. Presumably.