blob: 932292502590b99866c04c68fb57f2c40daec5c5 [file] [view] [edit]
## memfd-bind ##
> **NOTE**: Since runc 1.2.0, runc will now use a private overlayfs mount to
> protect the runc binary (if you are on Linux 5.1 or later). This protection
> is far more light-weight than memfd-bind, and for most users this should
> obviate the need for `memfd-bind` entirely. Rootless containers will still
> make a memfd copy (unless you are using `runc` itself inside a user namespace
> -- a-la [`rootlesskit`][rootlesskit] -- and are on Linux 5.11 or later), but
> `memfd-bind` is not particularly useful for rootless container users anyway
> (see [Caveats](#Caveats) for more details).
`runc` sometimes has to make a binary copy of itself when constructing a
container process in order to defend against certain container runtime attacks
such as CVE-2019-5736.
This cloned binary only exists until the container process starts (this means
for `runc run` and `runc exec`, it only exists for a few hundred milliseconds
-- for `runc create` it exists until `runc start` is called). However, because
the clone is done using a memfd (or by creating files in directories that are
likely to be a `tmpfs`), this can lead to temporary increases in *host* memory
usage. Unless you are running on a cgroupv1 system with the cgroupv1 memory
controller enabled and the (deprecated) `memory.move_charge_at_immigrate`
enabled, there is no effect on the container's memory.
However, for certain configurations this can still be undesirable. This daemon
allows you to create a sealed memfd copy of the `runc` binary, which will cause
`runc` to skip all binary copying, resulting in no additional memory usage for
each container process (instead there is a single in-memory copy of the
binary). It should be noted that (strictly speaking) this is slightly less
secure if you are concerned about Dirty Cow-like 0-day kernel vulnerabilities,
but for most users the security benefit is identical.
The provided `memfd-bind@.service` file can be used to get systemd to manage
this daemon. You can supply the path like so:
```bash
systemctl start memfd-bind@$(systemd-escape -p /usr/bin/runc)
```
Thus, there are three ways of protecting against CVE-2019-5736, in order of how
much memory usage they can use:
* `memfd-bind` only creates a single in-memory copy of the `runc` binary (about
10MB), regardless of how many containers are running.
* The classic method of making a copy of the entire `runc` binary during
container process setup takes up about 10MB per process spawned inside the
container by runc (both pid1 and `runc exec`).
[rootlesskit]: https://github.com/rootless-containers/rootlesskit
### Caveats ###
There are several downsides with using `memfd-bind` on the `runc` binary:
* The `memfd-bind` process needs to continue to run indefinitely in order for
the memfd reference to stay alive. If the process is forcefully killed, the
bind-mount on top of the `runc` binary will become stale and nobody will be
able to execute it (you can use `memfd-bind --cleanup` to clean up the stale
mount).
* Only root can execute the cloned binary due to permission restrictions on
accessing other process's files. More specifically, only users with ptrace
privileges over the memfd-bind daemon can access the file (but in practice
this is usually only root).
* When updating `runc`, the daemon needs to be stopped before the update (so
the package manager can access the underlying file) and then restarted after
the update.