Limiting services

It is a common requirement to place limits on what services can do.

Running under the aegises of unprivileged user accounts

One way to limit services is to use the operating system's mechanisms that limit what unprivileged users can do. One "drops privileges" from running as the superuser to running as an unprivileged user. One then …

Sometimes the main dæmon program itself drops privileges, internally. Usually it does this as part of an overall sequence of setting up a changed root and then dropping privileges. (This is because it requires a lot more setup, and a lot more files and directories exposed, to use chain-loading to set up a changed root. Chain loading involves the overlay of new process images that must be visible, along with the dynamic loader and any dynamically loaded shared objects, in the changed root environment. All of that needs to be set up with hard links, bind mounts, and whatnot.) In such cases, one can use envuidgid to look up the user ID and group ID of the unprivileged user account that the program should switch to, and place them in the environment for it to read.

Othertimes privileges are dropped by a sequence of chain-loading tools leading up to the execution of the main dæmon program itself. Usually this is the case where there is no changed root involved. In such cases, one can use setuidgid or setuidgid-fromenv to drop privileges.

This is heavily used in logging services.

Resource limits

Operating systems provide one or two mechanisms for setting resource limits for services.

Using the original Unix resource limit mechanism

The original Unix resource limit mechanism is controllable in a run program by using the softlimit and the hardlimit utilities, which have the conventional daemontools style of interface (complete with a unified "memory" setting that sets several limits as one), or the ulimit utility, which has an interface similar to the built-in command of the same name in POSIX-conformant shells. This resource limit mechanism has some well-known lacks, which one may or may not hit, depending from exactly what one's dæmon does. A dæmon that never spawns child processes will not, for example, raise the well-known problem that some of these Unix resource limits are per-process.

For example: The authors of MongoDB recommend several resource settings for when running MongoDB under a service manager. The run program for the MongoDB service bundle implements them as follows:

hardlimit -o 64000 -p 64000
softlimit -o hard -p hard

Using the Linux control groups

The Linux "control groups" mechanism is an enhanced and improved version of the original Unix mechanism, intended to overcome some of its limitations with respect to limits constraining multiple processes. It is used from run programs with the move-to-control-group, the set-control-group-knob, and the delegate-control-to-knob utilities.

The basic principles of operation are these:

An example of this is the user-services@username service, whose start program sets up a control group for the service, changes to it, and allows the named user to make further sub-groups:

move-to-control-group ../"user-services@".service
move-to-control-group "user-services@username".service
foreground delegate-control-group-to username ;

Its run program changes to the same control group and then drops privileges:

move-to-control-group ../"user-services@".service
move-to-control-group "user-services@username".service
setsid
setuidgid --supplementary username

Notice that this is an instance of a service that is generated (by the external formats conversion mechanism individually for each user) from a template. It employs a convention of a two-level set of control groups, one for all services generated from the template and one for each individual instance.

An example of a service that twiddles control group knobs is the dbus service, whose start program limits the number of processes that can run in the control group:

foreground set-control-group-knob ../cgroup.subtree_control "+pids" ;
move-to-control-group ../dbus.service
oom-kill-protect -- -800
foreground set-control-group-knob --percent-of /proc/sys/kernel/threads-max --infinity-is-max pids.max 20 ;

Its run program only needs to change to the same control group before dropping privileges (which is actually done by the main dæmon program itself):

move-to-control-group ../dbus.service
oom-kill-protect -- -800

This uses set-control-group-knob for two things:

A full description of what control group knobs there are and what limits they effect is beyond the scope of this Guide. See the documentation that accompanies the kernel, in particular Documentation/cgroup-v2.txt.

There is a notion circulated that a central "control groups manager" is required for Linux control groups. This is simply untrue, and the result of a control group "manager" (which merely did some rules matching in order to slap control groups onto processes that did not do control groups themselves) and a rejected proposal from systemd being presented on the World Wide Web for many years as if it were accomplished fact. Control groups do not require a central "manager", and were designed to be used in a distributed fashion with no central controller at all. The distributed operation here, where individual services create and configure control groups, separate to the system manager and service manager which also create and configure other control groups, is demonstration of that.

An example of what this results in

Here is a (slightly shortened) view of what the (unified) control groups tree looks like, as printed by systemd-cgls /, on a system that uses the native system manager, per-user manager, and service manager. The instances of /sbin/init are the system manager (PID 1), its logging service (PID 204), and the system-wide service manager (PID 205).

/:
├━me.slice
│ └━1 /sbin/init
├━service-manager.slice
│ ├━ttylogin@.service
│ │ ├━ttylogin@vc3-tty.service
│ │ │ └━935 login
│ │ │ └━27326 systemd-cgls /
│ │ └━ttylogin@vc2-tty.service
│ │   └━941 login
│ ├━tinydns.service
│ │ └━926 tinydns
│ ├━dnscache.service
│ │ └━927 dnscache
│ ├━NetworkManager.service
│ │ ├━1020 NetworkManager --no-daemon
│ │ └━1636 /sbin/dhclient -d -q -sf /usr/lib/NetworkManager/nm-dhcp-helper -p…
│ ├━dbus.service
│ │ └━846 dbus-daemon --config-file ./system-wide.conf --nofork --nopidfile -…
│ ├━udev-log.service
│ │ └━245 cyclog udev/
│ ├━me.slice
│ │ └━205 /sbin/init
│ ├━user-services@.service
│ │ └━user-services@jim.service
│ │   ├━me.slice
│ │   │ └━27299 per-user-manager
│ │   ├━service-manager.slice
│ │   │ └━me.slice
│ │   │   ├━27302 service-manager
│ │   │   ├━simple-servers-log.service
│ │   │   │ └━27309 cyclog jim/simple-servers/
│ │   │   └━urxvt.service
│ │   │     ├━27312 urxvtd
│ │   │     └━27313 urxvtd
│ │   └━per-user-manager-log.slice
│ │     └━27301 cyclog --max-file-size 262144 --max-total-size 1048576 .
│ ├━klogd.service
│ │ └━847 klog-read
│ ├━udev.service
│ │ └━250 udevd --debug
│ └━cyclog@.service
│   ├━cyclog@dnscache.service
│   │ └━725 cyclog dnscache/
│   ├━cyclog@NetworkManager.service
│   │ └━713 cyclog NetworkManager/
│   ├━cyclog@terminal-emulator@vc2.service
│   │ └━724 cyclog terminal-emulator@vc2/
│   ├━cyclog@local-syslog-read.service
│   │ └━738 cyclog local-syslog-read/
│   ├━cyclog@tinydns.service
│   │ └━720 cyclog tinydns/
│   ├━cyclog@dbus.service
│   │ └━735 cyclog dbus/
│   ├━cyclog@terminal-emulator@vc3.service
│   │ └━716 cyclog terminal-emulator@vc3/
│   ├━cyclog@ttylogin@vc2-tty.service
│   │ └━759 cyclog ttylogin@vc2-tty/
│   ├━cyclog@ttylogin@vc3-tty.service
│   │ └━760 cyclog ttylogin@vc3-tty/
│   └━cyclog@klogd.service
│     └━711 cyclog klogd/
└━system-manager-log.slice
  └━204 /sbin/init

Other toolsets and other settings

The nosh toolset is not the only toolset with chain loading tools for affecting dæmon process state. Other toolsets include various useful chain loading tools relating to resource usage control, such as:

rtprio (BSD) and chrt (Linux)
Change scheduling priority.
numactl (Linux)
Change NUMA settings.

Mounts and namespaces

Linux has a system of namespaces which can be used to limit what a service sees of the rest of the system. (See the Linux kernel doco for details of what the namespaces are.)

Manipulating Linux namespaces is the province of the unshare, set-mount-object, make-private-fs, and make-read-only-fs commands, used in chains in run programs. With them a process detaches from one or more shared namespaces, and then manipulates its (now) private namespaces to show a different view of the system.

For example, one can set up a "no hardware devices" view of the world, where only the "API" devices (for shared memory, pseudo-terminals, file descriptors, randomness, and suchlike) are available, with the following chain:

unshare --mount
set-mount-object --recursive slave /
make-private-fs --devices
set-mount-object --recursive shared /