| ## systemd cgroup driver |
| |
| By default, runc creates cgroups and sets cgroup limits on its own (this mode |
| is known as fs cgroup driver). When `--systemd-cgroup` global option is given |
| (as in e.g. `runc --systemd-cgroup run ...`), runc switches to systemd cgroup |
| driver. This document describes its features and peculiarities. |
| |
| ### systemd unit name and placement |
| |
| When creating a container, runc requests systemd (over dbus) to create |
| a transient unit for the container, and place it into a specified slice. |
| |
| The name of the unit and the containing slice is derived from the container |
| runtime spec in the following way: |
| |
| 1. If `Linux.CgroupsPath` is set, it is expected to be in the form |
| `[slice]:[prefix]:[name]`. |
| |
| Here `slice` is a systemd slice under which the container is placed. |
| If empty, it defaults to `system.slice`, except when cgroup v2 is |
| used and rootless container is created, in which case it defaults |
| to `user.slice`. |
| |
| Note that `slice` can contain dashes to denote a sub-slice |
| (e.g. `user-1000.slice` is a correct notation, meaning a subslice |
| of `user.slice`), but it must not contain slashes (e.g. |
| `user.slice/user-1000.slice` is invalid). |
| |
| A `slice` of `-` represents a root slice. |
| |
| Next, `prefix` and `name` are used to compose the unit name, which |
| is `<prefix>-<name>.scope`, unless `name` has `.slice` suffix, in |
| which case `prefix` is ignored and the `name` is used as is. |
| |
| 2. If `Linux.CgroupsPath` is not set or empty, it works the same way as if it |
| would be set to `:runc:<container-id>`. See the description above to see |
| what it transforms to. |
| |
| As described above, a unit being created can either be a scope or a slice. |
| For a scope, runc specifies its parent slice via a _Slice=_ systemd property, |
| and also sets _Delegate=true_. For a slice, runc specifies a weak dependency on |
| the parent slice via a _Wants=_ property. |
| |
| ### Resource limits |
| |
| runc always enables accounting for all controllers, regardless of any limits |
| being set. This means it unconditionally sets the following properties for the |
| systemd unit being created: |
| |
| * _CPUAccounting=true_ |
| * _IOAccounting=true_ (_BlockIOAccounting_ for cgroup v1) |
| * _MemoryAccounting=true_ |
| * _TasksAccounting=true_ |
| |
| The resource limits of the systemd unit are set by runc by translating the |
| runtime spec resources to systemd unit properties. |
| |
| Such translation is by no means complete, as there are some cgroup properties |
| that can not be set via systemd. Therefore, runc systemd cgroup driver is |
| backed by fs driver (in other words, cgroup limits are first set via systemd |
| unit properties, and when by writing to cgroupfs files). |
| |
| The set of runtime spec resources which is translated by runc to systemd unit |
| properties depends on kernel cgroup version being used (v1 or v2), and on the |
| systemd version being run. If an older systemd version (which does not support |
| some resources) is used, runc do not set those resources. |
| |
| The following tables summarize which properties are translated. |
| |
| #### cgroup v1 |
| |
| | runtime spec resource | systemd property name | min systemd version | |
| |-----------------------|-----------------------|---------------------| |
| | memory.limit | MemoryLimit | | |
| | cpu.shares | CPUShares | | |
| | blockIO.weight | BlockIOWeight | | |
| | pids.limit | TasksMax | | |
| | cpu.cpus | AllowedCPUs | v244 | |
| | cpu.mems | AllowedMemoryNodes | v244 | |
| |
| #### cgroup v2 |
| |
| | runtime spec resource | systemd property name | min systemd version | |
| |-------------------------|-----------------------|---------------------| |
| | memory.limit | MemoryMax | | |
| | memory.reservation | MemoryLow | | |
| | memory.swap | MemorySwapMax | | |
| | cpu.shares | CPUWeight | | |
| | pids.limit | TasksMax | | |
| | cpu.cpus | AllowedCPUs | v244 | |
| | cpu.mems | AllowedMemoryNodes | v244 | |
| | unified.cpu.max | CPUQuota, CPUQuotaPeriodSec | v242 | |
| | unified.cpu.weight | CPUWeight | | |
| | unified.cpu.idle | CPUWeight | v252 | |
| | unified.cpuset.cpus | AllowedCPUs | v244 | |
| | unified.cpuset.mems | AllowedMemoryNodes | v244 | |
| | unified.memory.high | MemoryHigh | | |
| | unified.memory.low | MemoryLow | | |
| | unified.memory.min | MemoryMin | | |
| | unified.memory.max | MemoryMax | | |
| | unified.memory.swap.max | MemorySwapMax | | |
| | unified.pids.max | TasksMax | | |
| |
| For documentation on systemd unit resource properties, see |
| `systemd.resource-control(5)` man page. |
| |
| ### Auxiliary properties |
| |
| Auxiliary properties of a systemd unit (as shown by `systemctl show |
| <unit-name>` after the container is created) can be set (or overwritten) by |
| adding annotations to the container runtime spec (`config.json`). |
| |
| For example: |
| |
| ```json |
| "annotations": { |
| "org.systemd.property.TimeoutStopUSec": "uint64 123456789", |
| "org.systemd.property.CollectMode":"'inactive-or-failed'" |
| }, |
| ``` |
| |
| The above will set the following properties: |
| |
| * `TimeoutStopSec` to 2 minutes and 3 seconds; |
| * `CollectMode` to "inactive-or-failed". |
| |
| The values must be in the gvariant text format, as described in |
| [gvariant documentation](https://docs.gtk.org/glib/gvariant-text-format.html). |
| |
| To find out which type systemd expects for a particular parameter, please |
| consult systemd sources. |