crash-reporter/docs/collectors.md - third_party/platform2 - Git at Google

 # Crash Collectors

 For each major class of crash reports, we define a dedicated *collector*.
 This is a simple way to encapsulate all related logic in a single module.

 When we run [crash_reporter], depending on its mode, it simply iterates through
 all registered collectors.

 The [crash_collector.cc] code isn't a real collector, it's the base class to
 hold common logic for all collectors.
 Similarly, [user_collector_base.cc] isn't a real collector, it's the base class
 to hold common logic for all user related collectors.

 The [core_collector] program is just a utility tool and not a collector in the
 sense of all these.
 It probably should have used a different naming convention.

 [TOC]

 # Basic Operations

 Each collector is designed to generate and queue crash reports.
 They get uploaded periodically by [crash_sender].

 ## Computing Crash Severity

 As part of the crash report, each collector computes the severity of the crash.

 Crash severity is organized into 4 categories:

 *   **Fatal:** Crashes that significantly disrupt user experience (e.g.
     session termination, device reboot, Android system server crash).
 *   **Error:** Crashes that disrupt user flow in tangible ways where there
     are mitigations or workarounds (e.g. tab crash, video blackout, Android
     system app crash).
 *   **Warning:** Crashes with little to no user impact.
 *   **Info:** Not actual crashes, but disagnostic data uploaded through the
     crash reporting pipeline from devices in the field.

 The logic to compute the severity is dependent on the product the crash occurred
 on: Platform, ARC, UI, and Lacros. The computed severity and product are logged
 to UMA to help correlate crashes with metrics like user satisfaction (from HaTS).

 # Boot Collectors

 These are the collectors that run once at boot.
 They are triggered via the [crash-boot-collect.conf] init service.
 They do not, by design, block the boot of the system.
 They are run in the background as a non-critical service.

 ## bert_collector

 This collects Boot Error Record Table ([BERT]) failures.

 The dump collected might be referred to as `bertdump`.

 *   Unhandled firmware errors that occurred in the previous boot are stored in
     the boot error region.
 *   The Kernel ACPI sysfs interface generates the BERT table at
     `/sys/firmware/acpi/tables/BERT` and BERT data at
     `/sys/firmware/acpi/tables/data/BERT`.
 *   During boot, if a BERT report exists, read them and create a report.

 ## ec_collector

 This collects [EC] (ChromeOS Embedded Controller) failures.

 The program name is `embedded-controller` and might be referred to as `eccrash`.

 *   The kernel driver [cros_ec_debugfs.c] sets up a debugfs path at
     `/sys/kernel/debug/cros_ec/`.
 *   The driver probes the [EC] to see if it has any panic logs.
 *   If the logs exist, the `/sys/kernel/debug/cros_ec/panicinfo` is created.
 *   During boot, if that file exists, we read it and create a report.

 [cros_ec_debugfs.c]: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/chromeos-4.14/drivers/platform/chrome/cros_ec_debugfs.c

 ## ephemeral_crash_collector

 This is a meta crash collector: it collects already collected ephemeral crashes
 into persistent storage. This is useful for handling crash reports in
 situations where we may not have access to persistent storage (eg. early boot).

 ## gsc_collector

 This collects Google Security Chip (GSC) failures.

 * Uses `gsctool` to query the GSC Flash Logs for any crashes.
 * During boot, if `gsctool` Flash Log output contains a crash signature, we
 create a report.

 ## kernel_collector

 This collects kernel (and BIOS) crashes that caused the system to reboot.

 It is built on top of [pstore] and doesn't support any other data source.
 We currently support the `ramoops` and `efi` backend drivers.

 The program name is `kernel` and might be referred to as `kcrash`.

 *   The BIOS/AP firmware maintain some dedicated space to hold a snippet of the
     kernel log.
     They make sure to not clear it during reboot in case there's valid data.
     *   For `ramoops`, CrOS firmware (e.g. coreboot) dedicate a chunk of RAM.
     *   For `efi`, the EFI firmware provides data in its own NVRAM space.
 *   While the kernel is running normally, a circular buffer is used to hold the
     most recent portion of the kernel log buffer (i.e. what `printk` writes to
     and what `dmesg` reads from).
 *   When the kernel reboots unexpectedly (e.g. due to a panic, oops, or BUG()),
     that error message is saved by [pstore] to the persistent location.
 *   If the watchdog reset the system, we won't have an explicit panic message,
     but we will have the last snippet of the kernel log buffer.
 *   During the next boot, the firmware makes sure that space is not reset.
 *   While the kernel boots, the [pstore] driver will check that common space to
     see if there are any valid records.
     All valid records are made available via files in `/sys/fs/pstore/`.
 *   During userspace boot, those paths are checked and reports are created.
 *   For panics the kernel handled, we'll read the logs from `dmesg-ramoops-*`
     & `dmesg-efi-*`, and generate a report for each one.
 *   Stack traces created by the kernel are analyzed to create a stack for the
     server, as well as generate a hash/fingerprint to correlate other reports.
 *   For watchdog resets, we'll first query the eventlog (from [elogtool]) to see
     if the reset was actually due to that.
     Normally we'd query the watchdog driver directly, but not all platforms are
     able to support that properly via the kernel driver.
     We'll create a simpler report using the last snippet of the kernel log from
     `console-ramoops-*` and hope the events just before the reset are enough to
     triage the problem.
 *   As records are processed, they get removed from the pstore area.
 *   On systems with coreboot BIOS, we also collect the BIOS log. This may be
     helpful to debug crashes when the kernel interacted with runtime firmware.
 *   coreboot maintains a ring buffer (the "CBMEM console") of log messages in a
     memory area that is considered reserved by the kernel. The buffer is never
     erased unless the memory loses power (i.e. if the system fully shuts down),
     and is usually large enough to hold messages from several prior boots.
 *   The collector will search the BIOS log for "banner" strings printed by
     coreboot on boot to determine where a reboot occured. It will only collect
     log lines from the boot prior to the current one (i.e. the one that the
     crash occured it).
 *   On arm64 systems, we also attempt to collect runtime firmware (BIOS)
     crashes. This is done by the kernel collector since runtime firmware mostly
     does things when requested by the kernel, and errors in runtime firmware are
     usually triggered by how the kernel calls it. The BIOS generally doesn't log
     very much after boot, we need both the kernel and BIOS logs to understand
     the situation of a runtime firmware crash. Since the kernel collector
     already has the logic to collect both of these, it makes sense for it to
     handle BIOS crash collection as well.
 *   On arm64, the runtime firmware is a piece of code called BL31 from the Arm
     Trusted Firmware project. BL31 logs crashes by dumping all CPU registers and
     knows how to append to the coreboot CBMEM console. Since we do not have the
     infrastructure to generate a full stack trace in firmware, we file these
     crash reports with a poor man's crash signature that just encodes the
     address of the program counter where the crash occured.

 ## unclean_shutdown_collector

 Collects unclean shutdown events.

 *   On every boot, crash_reporter is run.
     It creates a file (`/var/lib/crash_reporter/pending_clean_shutdown`) to
     indicate that the system hasn't gone through a clean shutdown.
 *   Upon clean shutdown ([chromeos_shutdown]), crash_reporter is run with the
     `--clean_shutdown` flag.
     The stateful partition file is removed to indicate the system has gone
     through a clean shutdown.
 *   If during boot, the file already exists before crash_reporter attempts to
     create it, this signifies that the system hadn't shut down cleanly.
     A signal is enqueued for metrics_daemon to emit user metrics about this
     unclean shutdown.
 *   No crash reports are otherwise generated for unclean shutdowns since it's
     not clear how we'd triage this in the first place (i.e. what to report).

 # Runtime Collectors

 Here are the collectors that are triggered on demand while the OS is running.
 They are invoked either by the kernel or by other program.

 ## arc_java_collector

 Collects Java crashes from programs inside the [ARC++] container or [ARCVM].

 ## arcpp_cxx_collector

 Collects crashes from Android NDK programs inside the [ARC++] container.
 It does not handle crashes from [ARC++] support daemons that run outside of the
 container as those are collected like any other userland crash via the main
 [user_collector].

 [arcpp_cxx_collector] shares a lot of code with [user_collector] so it can overlay
 [ARC++]-specific processing details.

 ## arcvm_kernel_collector

 Collects crashes of Linux kernel of Android in [ARCVM].

 When the ARCVM Linux kernel crashes, it dumps logs to
 `/sys/fs/pstore/dmesg-ramoops-0` in ARCVM.  It's a [pstore] file, so the
 backend exists on ChromeOS as `/home/root/<hash>/crosvm/*.pstore`.
 [arcvm_kernel_collector] receives the content of this file from
 ArcCrashCollector and ARC bridge via Mojo (or possibly, directly reads the
 ring buffer in pstore file) and processes it.

 ## arcvm_cxx_collector

 Collects crashes of machine-code binaries (i.e. non-Java crashes, it's mainly
 crashes of C++ programs) in [ARCVM].

 When a machine-code binary crashes, Linux kernel detects the crash and invokes
 `arc-native-crash-dispatcher` via `/proc/sys/kernel/core_pattern`.
 `arc-native-crash-dispatcher` calls `arc-native-crash-collector32` or
 `arc-native-crash-collector64`, and they dump crash file in
 `/data/vendor/arc_native_crash_reports` in ARCVM. A Java daemon
 `ArcCrashCollector` in ARCVM monitors this directory, and if new files
 appeared, then sends them to ARC bridge of Chrome browser via Mojo. Dump files
 are passed as FDs. And finally ARC bridge invokes `crash_reporter` with the FDs.

 ## chrome_collector

 Collects Chrome browser crashes.
 The browser will hand us the minidump directly, so we only attach system
 metadata and queue it.

 crash_reporter will be called by the kernel for Chrome crashes like any other
 [user_collector] crash, but we actually ignore these invocations.
 Chrome is supposed to catch the crash in its parent process and handle it
 itself; it links in [Google Breakpad] or [crashpad] directly to do so.
 This is because Chrome is better suited to know what memory regions to ignore
 (e.g. large heaps or file memory maps or graphics buffers), as well as what
 metadata to attach (e.g. the last URL visited, whether the process was a
 renderer, browser, plugin, or other kind of process, `chrome://flags`, etc...).
 Otherwise Chrome coredumps can easily consume 3GB+ of memory!

 This does mean the system may miss crashes if Chrome's handling itself is buggy.

 *** aside
 In much older versions of ChromeOS (sometime before R40), Chrome would not only
 handle creating its own crash reports, it would also handle uploading them.
 We changed that behavior because Chrome's uploading is not as robust: it starts
 uploading immediately, lacks delays/rate limiting, it tries only once, and if it
 fails at all, it throws away the report entirely.
 By queueing the report with crash-reporter, it avoids all those problems.
 ***

 ## mount_failure_collector

 Collects information on failures to mount or unmount partitions. This is invoked
 via [chromeos_startup] or [chromeos_shutdown] when the umount/mount operation
 fails.

 TODO(sarthakkukreti): Expand on this section

 ## udev_collector

 This collects crash/error events triggered by [udev] events.
 It is invoked via the [udev rules] and relies heavily on callbacks in the
 [crash_reporter_logs.conf] file.

 The program name is `udev`.

 udev_collector also collects connectivity firmware dumps and unlike generic
 coredumps, it stores connectivity firmware dumps in daemon-store directory of
 fbpreprocessord. The firmware dumps collected are attached to feedback reports
 instead of uploading to crash-reporter server.

 These reports are largely device specific as they try to capture whatever state
 the device/firmware needs to triage.

 TODO: Add devcoredump details if we ever enable them.

 ### Bluetooth Firmware Crashdump Collector

 This collects a device crashdump whenever a bluetooth controller crash occurs.
 It is part of udev_collector. The program name is `bt_firmware`.

 *   Whenever a bluetooth controller crash occurs, the host bluetooth kernel
     stack captures the crashdump and reports it via the devcoredump interface.
 *   This triggers a udev event which is captured by [udev_collector] and sent to
     [udev_bluetooth_util] for further processing.
 *   This dump is then locally parsed by [bluetooth_devcd_parser] which extracts
     the necessary information like controller type, firmware version, program
     counter, stack trace, etc., and generates a unique signature.
 *   This parsed crashdump along with logs is stored in a spool directory as
     specified under `Crash Report Storage` in the [Crash Reporter] document.

 ## user_collector

 Collects all userland crashes where the kernel dumps core.
 Basically any program that segfaults, aborts, violates a seccomp policy, or is
 otherwise unceremoniously killed.

 *   When a process crashes, the kernel invokes crash_reporter with various
     important runtime attributes (e.g. the pid, the uid, etc...).
     The kernel writes a full core dump of the process to stdin.
 *   At this point, the failing process is frozen until crash_reporter exits.
     That means any parent that is monitoring the child won't be notified until
     we finish processing.
     This is often a critical path operation if a service needs to be restarted.
     *   Chrome reports are ignored normally; see the [chrome_collector] section
         for more details as to why.
 *   The core2md is run to convert the full coredump to a minidump (`.dmp`).
     This process involves reading the core file contents to determine number of
     threads, register sets of all threads, and threads' stacks' contents.
     This is fundamental to our out-of-process design.
 *   When a crash occurs, we consider the effective user ID of the process which
     crashed to determine where to save it.
     If the crashed process was running as `chronos`, we enqueue its crash to
     `/home/user/<user_hash>/crash/` which is on the user-specific cryptohome
     when a user is logged in since it might have user PII in it.
     If the crashed process was running as any other user, we enqueue the crash
     in `/var/spool/crash`.
 *   The name of the crashing program is used to determine if we should gather
     additional diagnostic information.
     [crash_reporter_logs.conf] contains a list of executables and shell commands
     to run to gather more details.
     Any output from them will automatically be attached to the crash report as
     a `.log` file.

 ## vm_collector

 Used to process crash reports generated inside VMs. This is mostly a wrapper
 around writing the right collection of files to the right directory, as most
 useful crash information has to be gathered inside the VM, but it has
 responsibility for gathering any VM logs stored on the host.

 This collector writes to the new `/home/root/<user_hash>/crash` spool directory,
 as the daemons that interact directly with VMs to get crash information
 intentionally don't have the permissions required to access either of the
 existing spool directories.

 # Anomaly Detectors

 The [anomaly_detector] service is spawned early during boot via
 [anomaly-detector.conf].  It monitors various syslog files and tries to
 match a set of regexes. A match triggers a collection or a D-Bus signal,
 depending on the regex.

 A number of anomalies are sampled -- that is, we do not upload a report every
 time the anomaly occurs, but instead only 1 in every N times, where N is a value
 specific to that kind of crash. In this case, we generally also attach a
 "weight" field (with value N) to the crash report to indicate to the crash
 server that that report should count as N reports. This sampling is necessary in
 order to minimize load on the server and keep our total daily reports under 10
 million.

 *   Collection:

     *   [crash_reporter] is invoked for a specific collector, and is fed
     relevant lines via stdin.

 *   Signal:

     *   A D-Bus signal is emitted on a specific service.  Other processes may
         register for delivery of this signal. The service offers no
         methods.

 See sections below for more details on each collector and signal.

 The anomaly detector runs one collector at a time, and waits for it to
 finish running fully before processing more syslog entries.

 As a special case, only the first instance of each kernel warning is collected
 during a session (from boot to shutdown or crash).  A count of each warning is
 reported separately via a sparse UMA histogram.

 ## crash_reporter_failure_collector

 Collects log messages indicating that crash reporter itself crashed. Anomaly
 detector will invoke this collector at most once an hour, to prevent crash loops
 in crash reporter from generating an infinite set of calls to crash reporter.

 ## generic_failure_collector

 Responsible for collecting information on suspend failures and
 service failures. The architecture is generic and adapatable: It allows
 arbitrary weights, log names for [crash_reporter_logs.conf], etc.

 You can use this with any anomaly that can be passed to crash_reporter as a
 single line, optionally with additional data collected via
 [crash_reporter_logs.conf].

 ### service failures

 Collects warnings from the init (e.g. Upstart) for non-ARC services that failed
 to startup or exited unexpectedly at runtime.
 This catches syntax errors in the init scripts and daemons that simply exit
 non-zero but didn't otherwise trigger an abort or crash.

 The program name is `service-failure`.

 *   Lines from `init:` are processed.
 *   The standard upstart syntax is:
     `<daemon> <job phase> process (<pid>) terminated with status <status>`.
 *   All non-normal exits are recorded this way.
 *   The signature is constructed from the exit status and service name.

 ### arc service failures

 Similar to the above "service failures" except that it collects ARC
 services failures. ARC services are services with names started with "arc-".
 Separate ARC services logic is needed because the ARC services system log
 messages are kept in a separate file /var/log/arc.log.

 The program name is `arc-service-failure`.

 ### suspend failures

 When the system fails to suspend, we generate a report along with some log
 information on why the suspend failure happened.

 TODO(dbasehore): Expand on this section

 ### recovery failures

 When the cryptohome recovery process fails we generate a report with
 cryptohomed logs. This happens in these cases:
 *   Generation of the recovery request fails.
 *   Derivation of the recovery secret fails.

 The program name is `cryptohome`.

 ### auth failures

 When there are some auth failure on the previous life cycle of tcsd, we
 generate a report along with failed tpm commands.

 TODO(chingkang): Expand on this section

 ### modem failures

 When the modem rejects a user request to perform an operation on the modem, we
 generate a report (For e.g. Failure to connect to a network)

 ### tethering failures

 When the tethering failed due to an unexpected error, generate a report
 (e.g. session closed due to internal error or downstream link down).

 ## kernel_warning_collector

 Collects WARN() messages from anywhere in the depths of the kernel.
 Could be drivers, subsystems, or core logic.

 The program name is `kernel-warning` or `kernel-xxx-warning` (where `xxx` is a
 common subsystem/area) and might be referred to as `kcrash`.

 *   Whenever the kernel uses `WARN()` or `WARN_ON(...)` or any similar helper,
     it generates a standard log message including stack traces.
 *   By default, `kernel-warning` is used everywhere, but the location of drivers
     in the backtrace are used to further refine the name.
 *   The stack signature uses the same algorithm as the [kernel_collector].

 ## missed_crash_collector

 Invoked via [crash_reporter_parser]. collects log information when the kernel
 invokes [crash_reporter] for a chrome crash, but then chrome does not invoke
 crash_reporter within a reasonable timeframe (currently, 60 seconds).
 Includes chrome logs and syslogs.

 ## selinux_violation_collector

 Collects [SELinux] policy violations.

 The program name is `selinux-violation`.

 *   Lines from the audit subsystem are processed.
 *   Fields from each line are extracted (such as `name=` and `scontext=`) and
     used to create the magic signature.


 ## Out-Of-Memory kill signal (OOM kill)

 On detection of OOM-kill attempts in the kernel, [anomaly_detector] sends a
 D-Bus signal on /org/chromium/AnomalyEventService.  This is currently used by
 [memd] to collect a number of memory-manager related stats and events.

 [anomaly_detector] does not try to confirm that the kill is successful.

 [ARC++]: ../../arc/
 [ARCVM]: ../../arc/vm/
 [BERT]: https://www.uefi.org/sites/default/files/resources/ACPI%206_2_A_Sept29.pdf
 [EC]: https://chromium.googlesource.com/chromiumos/platform/ec
 [elogtool]: https://review.coreboot.org/plugins/gitiles/coreboot/+/HEAD/util/cbfstool/
 [Google Breakpad]: https://chromium.googlesource.com/breakpad/breakpad
 [Crash Reporter]: ../README.md
 [crashpad]: https://chromium.googlesource.com/crashpad/crashpad
 [memd]: ../../metrics/memd/
 [pstore]: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/v4.17/Documentation/admin-guide/ramoops.rst
 [SELinux]: https://en.wikipedia.org/wiki/Security-Enhanced_Linux
 [udev]: https://en.wikipedia.org/wiki/Udev

 [anomaly_detector]: ../anomaly_detector.cc
 [anomaly-detector.conf]: ../init/anomaly-detector.conf
 [arc_java_collector]: ../arc_java_collector.cc
 [arcpp_cxx_collector]: ../arcpp_cxx_collector.cc
 [arcvm_kernel_collector]: ../arcvm_kernel_collector.cc
 [arcvm_cxx_collector]: ../arcvm_cxx_collector.cc
 [bert_collector]: ../bert_collector.cc
 [chrome_collector]: ../chrome_collector.cc
 [chromeos_startup]: ../../init/chromeos_startup
 [chromeos_shutdown]: ../../init/chromeos_shutdown
 [core_collector]: ../core-collector/
 [crash-boot-collect.conf]: ../init/crash-boot-collect.conf
 [crash_collector.cc]: ../crash_collector.cc
 [crash_reporter]: ../crash_reporter.cc
 [crash_reporter_logs.conf]: ../crash_reporter_logs.conf
 [crash_reporter-parser]: ../crash_reporter_parser.cc
 [crash_sender]: ../crash_sender.cc
 [ec_collector]: ../ec_collector.cc
 [kernel_collector]: ../kernel_collector.cc
 [kernel_warning_collector]: ../kernel_warning_collector.cc
 [selinux_violation_collector]: ../selinux_violation_collector.cc
 [service_failure_collector]: ../service_failure_collector.cc
 [udev rules]: ../99-crash-reporter.rules
 [udev_collector]: ../udev_collector.cc
 [udev_bluetooth_util]: ../udev_bluetooth_util.cc
 [bluetooth_devcd_parser]: ../bluetooth_devcd_parser.cc
 [unclean_shutdown_collector]: ../unclean_shutdown_collector.cc
 [user_collector]: ../user_collector.cc
 [user_collector_base.cc]: ../user_collector_base.cc