For each major class of crash reports, we define a dedicated collector. This is a simple way to encapsulate all related logic in a single module.
When we run crash_reporter, depending on its mode, it simply iterates through all registered collectors.
The crash_collector.cc code isn‘t a real collector, it’s the base class to hold common logic for all collectors. Similarly, user_collector_base.cc isn‘t a real collector, it’s the base class to hold common logic for all user related collectors.
The core_collector program is just a utility tool and not a collector in the sense of all these. It probably should have used a different naming convention.
Each collector is designed to generate and queue crash reports. They get uploaded periodically by crash_sender.
These are the collectors that run once at boot. They are triggered via the crash-boot-collect.conf init service. They do not, by design, block the boot of the system. They are run in the background as a non-critical service.
This collects Boot Error Record Table (BERT) failures.
The dump collected might be referred to as bertdump
.
/sys/firmware/acpi/tables/BERT
and BERT data at /sys/firmware/acpi/tables/data/BERT
.This collects EC (Chrome OS Embedded Controller) failures.
The program name is embedded-controller
and might be referred to as eccrash
.
/sys/kernel/debug/cros_ec/
./sys/kernel/debug/cros_ec/panicinfo
is created.This is a meta crash collector: it collects already collected ephemeral crashes into persistent storage. This is useful for handling crash reports in situations where we may not have access to persistent storage (eg. early boot).
This collects kernel (and BIOS) crashes that caused the system to reboot.
It is built on top of pstore and doesn't support any other data source. We currently support the ramoops
and efi
backend drivers.
The program name is kernel
and might be referred to as kcrash
.
ramoops
, CrOS firmware (e.g. coreboot) dedicate a chunk of RAM.efi
, the EFI firmware provides data in its own NVRAM space.printk
writes to and what dmesg
reads from)./sys/fs/pstore/
.dmesg-ramoops-*
& dmesg-efi-*
, and generate a report for each one.console-ramoops-*
and hope the events just before the reset are enough to triage the problem.Collects unclean shutdown events.
/var/lib/crash_reporter/pending_clean_shutdown
) to indicate that the system hasn't gone through a clean shutdown.--clean_shutdown
flag. The stateful partition file is removed to indicate the system has gone through a clean shutdown.Here are the collectors that are triggered on demand while the OS is running. They are invoked either by the kernel or by other program.
Collects Java crashes from programs inside the ARC++ container or ARCVM.
Collects crashes from Android NDK programs inside the ARC++ container. It does not handle crashes from ARC++ support daemons that run outside of the container as those are collected like any other userland crash via the main user_collector.
arcpp_cxx_collector shares a lot of code with user_collector so it can overlay ARC++-specific processing details.
Collects crashes of Linux kernel of Android in ARCVM.
When the ARCVM Linux kernel crashes, it dumps logs to /sys/fs/pstore/dmesg-ramoops-0
in ARCVM. It's a pstore file, so the backend exists on Chrome OS as /home/root/<hash>/crosvm/*.pstore
. arcvm_kernel_collector receives the content of this file from ArcCrashCollector and ARC bridge via Mojo (or possibly, directly reads the ring buffer in pstore file) and processes it.
Collects crashes of machine-code binaries (i.e. non-Java crashes, it's mainly crashes of C++ programs) in ARCVM.
When a machine-code binary crashes, Linux kernel detects the crash and invokes arc-native-crash-dispatcher
via /proc/sys/kernel/core_pattern
. arc-native-crash-dispatcher
calls arc-native-crash-collector32
or arc-native-crash-collector64
, and they dump crash file in /data/vendor/arc_native_crash_reports
in ARCVM. A Java daemon ArcCrashCollector
in ARCVM monitors this directory, and if new files appeared, then sends them to ARC bridge of Chrome browser via Mojo. Dump files are passed as FDs. And finally ARC bridge invokes crash_reporter
with the FDs.
Collects Chrome browser crashes. The browser will hand us the minidump directly, so we only attach system metadata and queue it.
crash_reporter will be called by the kernel for Chrome crashes like any other user_collector crash, but we actually ignore these invocations. Chrome is supposed to catch the crash in its parent process and handle it itself; it links in Google Breakpad or crashpad directly to do so. This is because Chrome is better suited to know what memory regions to ignore (e.g. large heaps or file memory maps or graphics buffers), as well as what metadata to attach (e.g. the last URL visited, whether the process was a renderer, browser, plugin, or other kind of process, chrome://flags
, etc...). Otherwise Chrome coredumps can easily consume 3GB+ of memory!
This does mean the system may miss crashes if Chrome's handling itself is buggy.
Collects information on failures to mount or unmount partitions. This is invoked via chromeos_startup or chromeos_shutdown when the umount/mount operation fails.
TODO(sarthakkukreti): Expand on this section
This collects crash/error events triggered by udev events. It is invoked via the udev rules and relies heavily on callbacks in the crash_reporter_logs.conf file.
The program name is udev
.
These reports are largely device specific as they try to capture whatever state the device/firmware needs to triage.
TODO: Add devcoredump details if we ever enable them.
Collects all userland crashes where the kernel dumps core. Basically any program that segfaults, aborts, violates a seccomp policy, or is otherwise unceremoniously killed.
.dmp
). This process involves reading the core file contents to determine number of threads, register sets of all threads, and threads' stacks' contents. This is fundamental to our out-of-process design.chronos
, we enqueue its crash to /home/user/<user_hash>/crash/
which is on the user-specific cryptohome when a user is logged in since it might have user PII in it. If the crashed process was running as any other user, we enqueue the crash in /var/spool/crash
..log
file.Used to process crash reports generated inside VMs. This is mostly a wrapper around writing the right collection of files to the right directory, as most useful crash information has to be gathered inside the VM, but it has responsibility for gathering any VM logs stored on the host.
This collector writes to the new /home/root/<user_hash>/crash
spool directory, as the daemons that interact directly with VMs to get crash information intentionally don't have the permissions required to access either of the existing spool directories.
The anomaly_detector service is spawned early during boot via anomaly-detector.conf. It monitors various syslog files and tries to match a set of regexes. A match triggers a collection or a D-Bus signal, depending on the regex.
A number of anomalies are sampled -- that is, we do not upload a report every time the anomaly occurs, but instead only 1 in every N times, where N is a value specific to that kind of crash. In this case, we generally also attach a “weight” field (with value N) to the crash report to indicate to the crash server that that report should count as N reports. This sampling is necessary in order to minimize load on the server and keep our total daily reports under 10 million.
Collection:
Signal:
See sections below for more details on each collector and signal.
The anomaly detector runs one collector at a time, and waits for it to finish running fully before processing more syslog entries.
As a special case, only the first instance of each kernel warning is collected during a session (from boot to shutdown or crash). A count of each warning is reported separately via a sparse UMA histogram.
Collects log messages indicating that crash reporter itself crashed. Anomaly detector will invoke this collector at most once an hour, to prevent crash loops in crash reporter from generating an infinite set of calls to crash reporter.
Responsible for collecting information on suspend failures and service failures. The architecture is generic and adapatable: It allows arbitrary weights, log names for crash_reporter_logs.conf, etc.
You can use this with any anomaly that can be passed to crash_reporter as a single line, optionally with additional data collected via crash_reporter_logs.conf.
Collects warnings from the init (e.g. Upstart) for non-ARC services that failed to startup or exited unexpectedly at runtime. This catches syntax errors in the init scripts and daemons that simply exit non-zero but didn't otherwise trigger an abort or crash.
The program name is service-failure
.
init:
are processed.<daemon> <job phase> process (<pid>) terminated with status <status>
.Similar to the above “service failures” except that it collects ARC services failures. ARC services are services with names started with “arc-”. Separate ARC services logic is needed because the ARC services system log messages are kept in a separate file /var/log/arc.log.
The program name is arc-service-failure
.
When the system fails to suspend, we generate a report along with some log information on why the suspend failure happened.
TODO(dbasehore): Expand on this section
When there are some auth failure on the previous life cycle of tcsd, we generate a report along with failed tpm commands.
TODO(chingkang): Expand on this section
Collects WARN() messages from anywhere in the depths of the kernel. Could be drivers, subsystems, or core logic.
The program name is kernel-warning
or kernel-xxx-warning
(where xxx
is a common subsystem/area) and might be referred to as kcrash
.
WARN()
or WARN_ON(...)
or any similar helper, it generates a standard log message including stack traces.kernel-warning
is used everywhere, but the location of drivers in the backtrace are used to further refine the name.Invoked via [crash_reporter_parser]. collects log information when the kernel invokes crash_reporter for a chrome crash, but then chrome does not invoke crash_reporter within a reasonable timeframe (currently, 60 seconds). Includes chrome logs and syslogs.
Collects SELinux policy violations.
The program name is selinux-violation
.
name=
and scontext=
) and used to create the magic signature.On detection of OOM-kill attempts in the kernel, anomaly_detector sends a D-Bus signal on /org/chromium/AnomalyEventService. This is currently used by memd to collect a number of memory-manager related stats and events.
anomaly_detector does not try to confirm that the kill is successful.