Design doc: debugd

Warning: This document is old & has moved. Please update any links:
https://chromium.googlesource.com/chromiumos/platform2/+/HEAD/debugd/docs/design.md

Objective

Expose system debugging information over DBus to allow better sandboxing of the user session and more detailed diagnostic availability through Chrome.

Background

Currently, our debugging and diagnostic tools (specifically those implemented in crosh and in chrome://system) work by shelling out to run native code. This exposes a lot of surface area via crosh (and Chrome, to a lesser extent) and forces us to allow those contexts to execute programs and read files, which they otherwise have no need to do. Another concern is that some of these diagnostics (for example, crosh's ‘ping’) command rely on executing setuid binaries. Removing the ability to use setuid altogether from the user session and from crosh removes a lot of attack surface that is otherwise exposed in the linker and kernel.

Overview

Safely expose system debugging information over DBus. This allows us to restrict contexts which otherwise must have very broad access to exec() and setuid binaries to communicating over DBus.

Detailed Design

The debug daemon will be implemented as a single daemon, running as an unprivileged user, communicating over DBus. It will accept commands over DBus and either compute the information itself or run a helper program, then hand the result back over DBus. The debug daemon does not cache results for repeated requests. The debug daemon will run under strict seccomp system-call filtering rules, which will reduce the kernel ABI exposed to debugd and its helpers.

The debug daemon will present its functionality as a single object at a fixed path /org/chromium/debugd implementing the interface described in /dbus_bindings/org.chromium.debugd.xml. All the debugd methods can be synchronous, since it is used only to fetch debugging info - we don't need to worry about concurrent users since it is unlikely that the user will run two debug commands from two different crosh instances at once, and even if they do, the commands will be queued. Making chrome://system slower is something we do need to be concerned about. An example method might be:

CellularStatus : () -> a{sv}

“CellularStatus takes nothing and returns a map from string to variant.”

The implementation is documented in /doc/implementation. In general, the debug daemon blocks inside DBus, waiting for incoming messages; when it receives a message, it looks up the incoming message name in a method table and calls the associated function. The function gathers information and replies to the DBus message as needed.

The debug daemon also has a list of helpers, fixed at compiletime; when debugd starts up, it creates a new tmpfs, visible only to it and its descendants, and mounts it at /mnt/debugd. Each of the helper programs is then launched, and can spool information into the tmpfs as desired, presumably for collection by some method inside debugd. Some helpers are launched as needed instead of running persistently. Helper sources live in /src/helpers.

Files stored in the tmpfs can be written as json. Doing so makes it easier to write helpers, since a utility function is available for “reply to this dbus message with this json structure”. Protocol buffers are unsuitable for this because they are not self-describing; we would need to compile separate protobuf deserializers for each method into debugd and choose which one to use for each file.

Returning Complex Datastructures

Some methods have to return data structures that are not simple (for example, the ‘GetModemStatus’ method). For these methods, we have three choices for moving the complex data structure across DBus:

Transport them in DBus' wire format directly.
- P: No conversions needed in debugd
- P: Everyone talking to us implicitly speaks it
- C: Chrome needs to turn DBus wire format into its internal Value type for use/display
Transport them as protocol buffers.
- P: Typesafe on the wire
- C: Need to convert DBus to protobuf in debugd
- C: Need a C/C++ helper for crosh to print these
- C: Chrome needs to turn these into its internal Value type
Transport them as JSON.
- P: Chrome can serialize/deserialize natively.
- P: Human-readable; can be shown directly to user by crosh
- P: Parseable from Javascript; can manipulate it from an extension.
- C: Typesafe only at endpoints
- C: Need to convert DBus to JSON in debugd

We use JSON; although it makes more work for debugd, it makes it easier for Chrome and crosh to use debugd.

Security Considerations

This daemon will have its own attack surface which we need to take care of. Argument sanitization is of paramount importance, although using execve() instead of /bin/sh to run commands will remove an entire class of attacks that crosh currently has.

There are some security mitigations we can apply to debugd itself:

We can drop to a different uid/gid.
- If we use a dedicated gid for debugd, we can take a lot of files that are currently world-readable and instead make them root:debugd 0640.
We can chroot and put ourselves in a bare vfs namespace.
- If we do this, we have to bring the things we need into our namespace with us, although we can make their mounts read-only.
- This doesn't really buy us anything over seccomp-filter if our policy is appropriately tight, but eventually we might need to allow writes for some debug tools, which would make this a good line of defense.
We can seccomp-sandbox ourselves with syscall filtering, since we should only need to do a fairly restrictive set of things.
- This will probably involve a lot of effort. Tracking down which syscalls various helper programs use and keeping the filter policy up-to-date will take time.
- The decrease in kernel and platform (filesystem permissions, etc.) attack surface gained is worth it.
We can set rlimits, if we feel so inclined.
- The particular gain we might get here is that we can restrict the number of outstanding helper programs we can have running at a time, which might avoid systemwide denial-of-service attacks.
- On the other hand, it opens us up to much easier denial of service against the debug daemon. The debug daemon would have to kill helper programs that ran past a certain time limit, but perhaps it has to do this already.

There are some mitigations we can't apply yet:

We can't enable SECURE_NOROOT, since some of our helper programs (e.g. /bin/ping) are setuid. Fixing this is going to require some fairly major legwork.
We can‘t use a pid namespace, because this destroys the crash reporter on 2.6.38. There’s a patch floating around to fix this that we'd need to apply.
We can't use a network namespace, because some of our tools (ping, traceroute) need access to the real network.

Testing Plan

We can broadly divide debugd's functionality into two classes for testing purposes: functions that generate new information (like ping or traceroute), and functions that return already-generated information (like reading information out of sysfs).

Functions that generate new information are often sensitive to the surrounding hardware/network environment - for example, pinging an outside host relies on working networking and such. We can sometimes test these functions by relying only on things we know exist in any sane test environment (like pinging 127.0.0.1 and making sure we get properly-formatted output), but some of them (3g status, for example) rely on hardware state, and for these we need a human to ensure the output lines up with hardware.

Functions that return already-generated information can be tested by using minijail‘s chroot-and-bind functionality to fake the already-generated information, then testing debugd’s returns against the known fake data.

ellyjones: add more detail here