docs/user-namespaces/README.md - third_party/containerd - Git at Google

 # Support for user namespaces

 Kubernetes supports running pods with user namespace since v1.25. This document explains the
 containerd support for this feature.

 ## What are user namespaces?

 A user namespace isolates the user running inside the container from the one in the host.

 A process running as root in a container can run as a different (non-root) user in the host; in
 other words, the process has full privileges for operations inside the user namespace, but is
 unprivileged for operations outside the namespace.

 You can use this feature to reduce the damage a compromised container can do to the host or other
 pods in the same node. There are several security vulnerabilities rated either HIGH or CRITICAL that
 were not exploitable when user namespaces is active. It is expected user namespace will mitigate
 some future vulnerabilities too.

 See [the kubernetes documentation][kube-intro] for a high-level introduction to
 user namespaces.

 [kube-intro]: https://kubernetes.io/docs/concepts/workloads/pods/user-namespaces/#introduction

 ## Stack requirements

 The Kubernetes implementation was redesigned in 1.27, so the requirements are different for versions
 pre and post Kubernetes 1.27.

 Please note that if you try to use user namespaces with containerd 1.6 or older, the `hostUsers:
 false` setting in your pod.spec will be **silently ignored**.

 ### Kubernetes 1.25 and 1.26

  * Containerd 1.7
  * You can use runc or crun as the OCI runtime:
    * runc 1.1 or greater
    * crun 1.4.3 or greater

 You can also use containerd 2.0 or above, but the same [requirements as Kubernetes 1.27 and
 greater](#Kubernetes-127-and-greater) apply, except for the Linux kernel. Bear in mind that all the
 requirements there apply, including file-systems supporting idmap mounts. You can use Linux
 versions:

  * Linux 5.15: you will suffer from [the containerd 1.7 storage and latency
    limitations](#Limitations), as it doesn't support idmap mounts for overlayfs.
  * Linux 5.19 or greater (recommended): it doesn't suffer from any of the containerd 1.7
    limitations, as overlayfs started supporting idmap mounts on this kernel version.

 ### Kubernetes 1.27 and greater

  * Linux 6.3 or greater
  * Containerd 2.0 or greater
  * You can use runc or crun as the OCI runtime:
    * runc 1.2 or greater
    * crun 1.9 or greater

 Furthermore, all the file-systems used by the volumes in the pod need kernel-support for idmap
 mounts. Some popular file-systems that support idmap mounts in Linux 6.3 are: `btrfs`, `ext4`, `xfs`,
 `fat`, `tmpfs`, `overlayfs`.

 The kubelet is in charge of populating some files to the containers (like configmap, secrets, etc.).
 The file-system used in that path needs to support idmap mounts too. See [the Kubernetes
 documentation][kube-req] for more info on that.


 [kube-req]: https://kubernetes.io/docs/concepts/workloads/pods/user-namespaces/#before-you-begin

 ## Creating a Kubernetes pod with user namespaces

 First check your containerd, Linux and Kubernetes versions. If those are okay, then there is no
 special configuration needed on conntainerd. You can just follow the steps in the [Kubernetes
 website][kube-example].

 [kube-example]: https://kubernetes.io/docs/tasks/configure-pod-container/user-namespaces/

 # Limitations

 You can check the limitations Kubernetes has [here][kube-limitations]. Note that different
 Kubernetes versions have different limitations, be sure to check the site for the Kubernetes version
 you are using.

 Different containerd versions have different limitations too, those are highlighted in this section.

 [kube-limitations]: https://kubernetes.io/docs/concepts/workloads/pods/user-namespaces/#limitations

 ### containerd 1.7

 One limitation present in containerd 1.7 is that it needs to change the ownership of every file and
 directory inside the container image, during Pod startup. This means it has a storage overhead, as
 **the size of the container image is duplicated each time a pod is created**, and can significantly
 impact the container startup latency, as doing such a copy takes time too.

 You can mitigate this limitation by switching `/sys/module/overlay/parameters/metacopy` to `Y`. This
 will significantly reduce the storage and performance overhead, as only the inode for each file of
 the container image will be duplicated, but not the content of the file. This means it will use less
 storage and it will be faster. However, it is not a panacea.

 If you change the metacopy param, make sure to do it in a way that is persistent across reboots. You
 should also be aware that this setting will be used for all containers, not just containers with
 user namespaces enabled. This will affect all the snapshots that you take manually (if you happen to
 do that). In that case, make sure to use the same value of `/sys/module/overlay/parameters/metacopy`
 when creating and restoring the snapshot.

 ### containerd 2.0 and above

 The storage and latency limitation from containerd 1.7 are not present in container 2.0 and above,
 if you use the overlay snapshotter (this is used by default). It will not use more storage at all,
 and there is no startup latency.

 This is achieved by using the kernel feature idmap mounts with the container rootfs (the container
 image). This allows an overlay file-system to expose the image with different UID/GID without copying
 the files nor the inodes, just using a bind-mount.

 Containerd by default will refuse to create a container with user namespaces, if overlayfs is the
 snapshotter and the kernel running doesn't support idmap mounts for overlayfs.  This is to make sure
 before falling back to the expensive chown (in terms of storage and pod startup latency), you
 understand the implications and decide to opt-in. Please read the containerd 1.7 limitations for an
 explanation of those.

 If your kernel doesn't support idmap mounts for the overlayfs snapshotter, you will see an error
 like:

 ```
 failed to create containerd container: snapshotter "overlayfs" doesn't support idmap mounts on this host, configure `slow_chown` to allow a slower and expensive fallback
 ```

 Linux supports idmap mounts on an overlayfs since version 5.19.

 You can opt-in for the slow chown by adding the `slow_chown` field to your config in the overlayfs
 snapshotter section, like this:

 ```
   [plugins."io.containerd.snapshotter.v1.overlayfs"]
     slow_chown = true
 ```

 Note that only overlayfs users need to opt-in for the slow chown, as it as it is the only one that
 containerd provides a better option (only the overlayfs snapshotter supports idmap mounts in
 containerd). If you use another snapshotter, you will fall-back to the expensive chown without the
 need to opt-in.

 That being said, you can double check if your container is using idmap mounts for the container
 image if you create a pod with user namespaces, exec into it and run:

 ```
 mount | grep overlay
 ```

 You should see a reference to the idmap mount in the `lowerdir` parameter, in this case we can see
 `idmapped` used there:

 ```
 overlay on / type overlay (rw,relatime,lowerdir=/tmp/ovl-idmapped823885363/0,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/1018/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/1018/work)
 ```

 ## Creating a container with user namespaces with `ctr`

 You can also create a container with user namespaces using `ctr`. This is more low-level, be warned.

 Create a directory where we will work:

 ```sh
 mkdir -p /tmp/userns-test
 cd /tmp/userns-test
 ```

 Please note that we will need +x permissions to all components in the path to the rootfs (like
 `/tmp` and `/tmp/rootfs`). So, it's recommended to do this inside `/tmp`, as that will have the
 right permissions.

 Create an OCI bundle:
 ```sh
 # create the rootfs directory
 mkdir rootfs

 # export busybox via Docker into the rootfs directory
 docker export $(docker create busybox) | tar -C rootfs -xvf -

 # adjust the permissions
 sudo chown -R 65536:65536 rootfs/
 ```

 Copy [this config.json](./config.json) to `/tmp/userns-test`. Please note the process.root.path
 field in the config.json it's pointing to the rootfs we just created. This **needs to be an
 absolute path**.

 Then create and start the container with:

 ```
 sudo ctr c create --config config.json userns-test
 sudo ctr t start userns-test
 ```

 This will open a shell inside the container. You can run this, to verify you are inside a user
 namespace:

 ```
 root@runc:/# cat /proc/self/uid_map
          0      65536      65536
 ```

 The output should be exactly the same.
	# Support for user namespaces

	Kubernetes supports running pods with user namespace since v1.25. This document explains the
	containerd support for this feature.

	## What are user namespaces?

	A user namespace isolates the user running inside the container from the one in the host.

	A process running as root in a container can run as a different (non-root) user in the host; in
	other words, the process has full privileges for operations inside the user namespace, but is
	unprivileged for operations outside the namespace.

	You can use this feature to reduce the damage a compromised container can do to the host or other
	pods in the same node. There are several security vulnerabilities rated either HIGH or CRITICAL that
	were not exploitable when user namespaces is active. It is expected user namespace will mitigate
	some future vulnerabilities too.

	See [the kubernetes documentation][kube-intro] for a high-level introduction to
	user namespaces.

	[kube-intro]: https://kubernetes.io/docs/concepts/workloads/pods/user-namespaces/#introduction

	## Stack requirements

	The Kubernetes implementation was redesigned in 1.27, so the requirements are different for versions
	pre and post Kubernetes 1.27.

	Please note that if you try to use user namespaces with containerd 1.6 or older, the `hostUsers:
	false` setting in your pod.spec will be silently ignored.

	### Kubernetes 1.25 and 1.26

	* Containerd 1.7
	* You can use runc or crun as the OCI runtime:
	* runc 1.1 or greater
	* crun 1.4.3 or greater

	You can also use containerd 2.0 or above, but the same [requirements as Kubernetes 1.27 and
	greater](#Kubernetes-127-and-greater) apply, except for the Linux kernel. Bear in mind that all the
	requirements there apply, including file-systems supporting idmap mounts. You can use Linux
	versions:

	* Linux 5.15: you will suffer from [the containerd 1.7 storage and latency
	limitations](#Limitations), as it doesn't support idmap mounts for overlayfs.
	* Linux 5.19 or greater (recommended): it doesn't suffer from any of the containerd 1.7
	limitations, as overlayfs started supporting idmap mounts on this kernel version.

	### Kubernetes 1.27 and greater

	* Linux 6.3 or greater
	* Containerd 2.0 or greater
	* You can use runc or crun as the OCI runtime:
	* runc 1.2 or greater
	* crun 1.9 or greater

	Furthermore, all the file-systems used by the volumes in the pod need kernel-support for idmap
	mounts. Some popular file-systems that support idmap mounts in Linux 6.3 are: `btrfs`, `ext4`, `xfs`,
	`fat`, `tmpfs`, `overlayfs`.

	The kubelet is in charge of populating some files to the containers (like configmap, secrets, etc.).
	The file-system used in that path needs to support idmap mounts too. See [the Kubernetes
	documentation][kube-req] for more info on that.


	[kube-req]: https://kubernetes.io/docs/concepts/workloads/pods/user-namespaces/#before-you-begin

	## Creating a Kubernetes pod with user namespaces

	First check your containerd, Linux and Kubernetes versions. If those are okay, then there is no
	special configuration needed on conntainerd. You can just follow the steps in the [Kubernetes
	website][kube-example].

	[kube-example]: https://kubernetes.io/docs/tasks/configure-pod-container/user-namespaces/

	# Limitations

	You can check the limitations Kubernetes has [here][kube-limitations]. Note that different
	Kubernetes versions have different limitations, be sure to check the site for the Kubernetes version
	you are using.

	Different containerd versions have different limitations too, those are highlighted in this section.

	[kube-limitations]: https://kubernetes.io/docs/concepts/workloads/pods/user-namespaces/#limitations

	### containerd 1.7

	One limitation present in containerd 1.7 is that it needs to change the ownership of every file and
	directory inside the container image, during Pod startup. This means it has a storage overhead, as
	the size of the container image is duplicated each time a pod is created, and can significantly
	impact the container startup latency, as doing such a copy takes time too.

	You can mitigate this limitation by switching `/sys/module/overlay/parameters/metacopy` to `Y`. This
	will significantly reduce the storage and performance overhead, as only the inode for each file of
	the container image will be duplicated, but not the content of the file. This means it will use less
	storage and it will be faster. However, it is not a panacea.

	If you change the metacopy param, make sure to do it in a way that is persistent across reboots. You
	should also be aware that this setting will be used for all containers, not just containers with
	user namespaces enabled. This will affect all the snapshots that you take manually (if you happen to
	do that). In that case, make sure to use the same value of `/sys/module/overlay/parameters/metacopy`
	when creating and restoring the snapshot.

	### containerd 2.0 and above

	The storage and latency limitation from containerd 1.7 are not present in container 2.0 and above,
	if you use the overlay snapshotter (this is used by default). It will not use more storage at all,
	and there is no startup latency.

	This is achieved by using the kernel feature idmap mounts with the container rootfs (the container
	image). This allows an overlay file-system to expose the image with different UID/GID without copying
	the files nor the inodes, just using a bind-mount.

	Containerd by default will refuse to create a container with user namespaces, if overlayfs is the
	snapshotter and the kernel running doesn't support idmap mounts for overlayfs. This is to make sure
	before falling back to the expensive chown (in terms of storage and pod startup latency), you
	understand the implications and decide to opt-in. Please read the containerd 1.7 limitations for an
	explanation of those.

	If your kernel doesn't support idmap mounts for the overlayfs snapshotter, you will see an error
	like:

	```
	failed to create containerd container: snapshotter "overlayfs" doesn't support idmap mounts on this host, configure `slow_chown` to allow a slower and expensive fallback
	```

	Linux supports idmap mounts on an overlayfs since version 5.19.

	You can opt-in for the slow chown by adding the `slow_chown` field to your config in the overlayfs
	snapshotter section, like this:

	```
	[plugins."io.containerd.snapshotter.v1.overlayfs"]
	slow_chown = true
	```

	Note that only overlayfs users need to opt-in for the slow chown, as it as it is the only one that
	containerd provides a better option (only the overlayfs snapshotter supports idmap mounts in
	containerd). If you use another snapshotter, you will fall-back to the expensive chown without the
	need to opt-in.

	That being said, you can double check if your container is using idmap mounts for the container
	image if you create a pod with user namespaces, exec into it and run:

	```
	mount \| grep overlay
	```

	You should see a reference to the idmap mount in the `lowerdir` parameter, in this case we can see
	`idmapped` used there:

	```
	overlay on / type overlay (rw,relatime,lowerdir=/tmp/ovl-idmapped823885363/0,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/1018/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/1018/work)
	```

	## Creating a container with user namespaces with `ctr`

	You can also create a container with user namespaces using `ctr`. This is more low-level, be warned.

	Create a directory where we will work:

	```sh
	mkdir -p /tmp/userns-test
	cd /tmp/userns-test
	```

	Please note that we will need +x permissions to all components in the path to the rootfs (like
	`/tmp` and `/tmp/rootfs`). So, it's recommended to do this inside `/tmp`, as that will have the
	right permissions.

	Create an OCI bundle:
	```sh
	# create the rootfs directory
	mkdir rootfs

	# export busybox via Docker into the rootfs directory
	docker export $(docker create busybox) \| tar -C rootfs -xvf -

	# adjust the permissions
	sudo chown -R 65536:65536 rootfs/
	```

	Copy [this config.json](./config.json) to `/tmp/userns-test`. Please note the process.root.path
	field in the config.json it's pointing to the rootfs we just created. This **needs to be an
	absolute path**.

	Then create and start the container with:

	```
	sudo ctr c create --config config.json userns-test
	sudo ctr t start userns-test
	```

	This will open a shell inside the container. You can run this, to verify you are inside a user
	namespace:

	```
	root@runc:/# cat /proc/self/uid_map
	0 65536 65536
	```

	The output should be exactly the same.