crash-reporter/docs/design.md - third_party/platform2 - Git at Google

 # ChromiumOS Crash Reporting (original design doc)

 *2011-05-15*

 *** note
 This is the original design doc for this project.
 Some aspects are out of date or no longer accurate, but is still useful for
 historical reference purposes.
 We don't plan on updating or "fixing" content here as the current state of the
 project is better reflected in other files.
 Start with the [README](../README.md).
 ***

 [TOC]

 # Objective

 We intend to design a system for generating statistics and diagnostic
 information on a variety of crashes.

 Our goals for devices that are opted into automatically submitting usage
 feedback and crash reports to Google:

 *   Detect crashes in any system process and accumulate a User Metrics counter
     of number of user space crashes, usage time between crashes, and crashes per
     day/week.
 *   Detect unclean shutdowns on the system and accumulate a User Metrics counter
     of number of unclean shutdowns, usage time between unclean shutdowns, and
     unclean shutdowns per day/week.
 *   Detect kernel crashes (Linux kernel panics) on the system and accumulate a
     User Metrics counter of number of kernel crashes, usage time between
     crashes, and crashes per day/week.
 *   Generate diagnostic information for any user space process that crashes with
     enough information to generate stack traces server-side.
 *   Generate other diagnostic information that is application specific for
     specific user space process crashes.
 *   Generate diagnostic information for any kernel crash (panic) with kernel
     debug messages including the kernel-generated stack trace at the point of
     crash.
 *   Upload diagnostics in spite of system failures (unusable components like
     Chrome or X), temporary network failures, and use a system-wide upload rate
     throttle.
 *   Store crash diagnostics for user-specific crashes in the cryptohome which is
     encrypted so that any diagnostics with sensitive information that needs to
     be stored after the user logs out because of a failure described above
     cannot be viewed by any other user.
     This means we will need to upload the crash report once the user has
     provided their credentials later and is logged in.

 Our non-goals:

 *   Recognizing a wide variety of very bad user space problems such as a Chrome,
     X, or Window Manager that immediate exits and causes the machine to not be
     usable.

     *> Are you talking about processes that exit with a failed assert, or
     something else?
     If the former, it seems like we'd be able to report that.*

     *> I'm not opposed to adding this to the longer-term goals, but I'm not sure
     what interface would be appropriate here - looking through syslogs for
     errors, looking through application specific logs?*

 # Background

 Our goal is to provide a stable platform.
 We need to be able to diagnose failures that do not occur in our labs or that
 are otherwise hard to reproduce automatically.

 ## Existing User Space Crash Reporting

 [Google Breakpad] is used by most Google applications on Mac, Windows, and Linux
 for x86 and ARM architectures to send "minidump" crash reports, very small crash
 reports that contain enough information to provide a stack trace of all threads
 running at the time of the crash.
 [Google Breakpad] does (as of Q1 2010) support ARM but is not yet used in
 production.
 Chrome in ChromeOS currently uses [Google Breakpad] and sends crash reports
 with product ID "Chrome_ChromeOS".

 The Canonical Ubuntu Linux project uses [Apport] to handle user space crashes.
 This is a Python package that intercepts all core file writes, invokes gdb, and
 collects stack dumps into a directory which it then sends out using an Anacron
 job.
 It relies on Python and debug information being present on the target.

 The Linux kernel creates core files for processes that encounter unhandled
 signals.
 As of 2.6 kernels, the file location and naming can be customized by changing
 [/proc/sys/kernel/core_pattern].
 Once core files are written they can be manually inspected by a developer or
 advanced user.
 Additionally, this kernel parameter can be set to cause a pipe to be opened to
 a user space process which then can receive the core file that would have been
 written to its stdin.
 We will rely on this mechanism to get diagnostic information and signaling for
 all user space processes.

 On Windows, Microsoft has created the [WINQUAL] service which allows developers
 to retrieve crash diagnostics for their applications.
 When a Windows application crashes and does not handle the crash itself, the
 operating system prompts the user if they would like to send this particular
 crash report, and uploads it upon receiving consent.
 The [WINQUAL] service then aggregates and shows the reports.
 Crash reports can be sent as full dumps or as minidumps using the same format
 that [Google Breakpad] uses.

 ## Existing Kernel Space Crash Reporting

 [Linux Kernel Crash Dump] ([LKCD]) is a set of kernel patches and a user-space
 application that enables a panicked Linux kernel to write crash diagnostics to
 its swap partition and then diagnose the crash and store it in a simplified
 form.
 It provides a command-line utility for diagnosing kernel state, but requires a
 fairly large file to be uploaded to diagnose the running kernel state remotely.
 This patch was last updated in 2005 and is an invasive kernel patch that's
 difficult to maintain and will never go upstream.

 [kexec] based dump - is a method where the kernel "exec"s a new kernel that is
 stable into a reserved area of memory without performing a system reset first.
 The new stable kernel writes out the state and data of the old kernel to disk,
 whilst only operating from the reserved memory area.
 When any relevant state is written, the rest of memory is reclaimed and
 initialized and full system boot is completed.
 The patches for kexec are already upstream.

 http://www.kerneloops.org/ - Collects crash dumps and provides a dashboard for
 all kernel developers to find crashes common across all versions, as well as
 specific to vendors/distributors.
 Provided they have enough server-side capacity to handle crash dumps from
 ChromeOS scale numbers of machines this is an option.
 kerneloops provides a user space process that runs at startup, prompts the user
 if they want to upload the kernel crash, and uploads an analyzed result.

 Ubuntu uses [Apport] and [LKCD] to handle kernel crashes.
 It invokes lcrash to perform crash analysis on the vmcore image.

 ## Existing Firmware Space Crash Reporting

 [Firmware event logs] can be stored in non-volatile storage.
 Traditionally problems during firmware initialization as well as kernel
 panic/problems can be placed here.

 # Requirements and Scale

 ## Crash Diagnostic Information Collection

 There can be different systems for recording kernel and user space crash
 diagnostic information.
 We ideally want stack traces at time of crash for either kind of crash.
 Some kinds of kernel crashes (i.e. crashes in interrupt handlers) by their
 nature will not be able to generate/persist any diagnostic information.

 For user space crashes, we would need:
 * Identification of executable
 * The context (parameters, environment, cwd)
 * Stack trace (this is nice to have but also difficult in general)

 We will use rate limiting to avoid flooding our servers with crash diagnostics.
 We will limit to 8 crash reports per machine per day.

 We need to build executables and libraries with debug information and upload
 these to crash server for those for which we would like stack traces with proper
 symbolic information.

 ## Crash Statistics Collection

 We would like to have statistics on how often crashes are occurring in the
 field.
 For every release of ChromeOS on every device we would like to know how
 frequent unclean shutdowns, user space process, and kernel crashes are.
 Ideally we can know information on occurrences per individual user, for
 instance, knowing that 1% of users experience over 5 kernel panics per week.
 We will generate frequency data for these kinds of events in the course of a
 day and per week.

 ## Protecting User Privacy

 We must err on the side of getting too little information if the alternative is
 to potentially send sensitive information of a user who has not enabled this.
 As such, we should be careful, for instance, to not send kernel core files as
 the kernel core may have information for a variety of users.
 We also must avoid sending log files that may capture the accumulated activities
 of multiple users.
 We will send a unique but anonymous identifier per device to find potentially
 related crashes by those which happen on the same device and to help eliminate
 crashes from buggy/broken devices.
 User space processes which crash and which interact closely with the user, such
 as Chrome, the window manager, entd, and others are more likely to have
 sensitive data in memory at the time of crash.
 For this reason we encrypt the diagnostics generated from all executables which
 run as Linux user 'chronos' (which means they are started when the user logs in
 and terminated upon log out) when stored to disk.
 Since the encryption is based on the user's password, the only way a user's
 crash diagnostics can be sent is when they are currently logged in to the
 device.

 # Design Ideas

 We will separate kernel and user space diagnostic gathering in implementation.
 Both however, need to adhere to our EULA with the user.
 During the out of box experience the owner chooses if he/she would like crashes
 on this device to be uploaded to Google servers.
 We must never send a crash report if they do not give consent.
 They may rescind their consent at any time which means that if we have enqueued
 a crash report to be sent which was created at a time when the user consented,
 and they rescind their consent before the crash is sent, the crash report must
 be discarded.

 ## User space crash handling

 *   Upon a crash occurring, the kernel invokes crash_reporter indicating the
     name and process ID of the crashing process and pipes in to it a full core
     dump of the process.
 *   Chrome links in [Google Breakpad] directly and handles its own crashes and
     uploads them.
     It generates its own statistics for various kinds of crashes and sends
     application specific information (last URL visited, whether the process was
     a renderer, browser, plugin, or other kind of process).
     The system crash handling mechanism will ignore Chrome crashes since they
     are already handled internally.
 *   crash_reporter will invoke core2md to convert the full core dump to the
     minidump format.
     This process involves reading the core file contents to determine number of
     threads, register sets of all threads, and threads' stacks' contents.
     We created core2md by modifying [Google Breakpad] specifically for Chrome
     OS's uses.
     [Google Breakpad] is normally linked directly into individual executables
     whose crashes we want to generate diagnostics for.
     Since ChromeOS has hundreds of these executables, catching signals can
     interfere with executables' own code, and some executables are only
     delivered to Google in binary form, we found the conversion of full core
     files from the kernel to minidump files to be a superior way to generate
     crash diagnostics for the entire system.
 *   When a crash occurs, we consider the effective user ID of the process which
     crashed to indicate if the crash report should be encrypted due to having
     higher risk of containing sensitive information.
     If the crashed process was running as `chronos` we enqueue its crash to
     `/home/chronos/user/crash` which is on the cryptohome when a user is logged
     in and so it will be encrypted.
     If the crashed process was running as any other user, we enqueue the crash
     in `/var/spool/crash`.
 *   In the future, encrypted crash reports will go to
     `/home/root/<user_hash>/crash/`. This directory is still part of the
     cryptohome, but can be accessed without running as chronos. This will allow
     both creating and uploading crash reports with lower privileges.
 *   The name of the crash is used to determine if we should gather additional
     diagnostic information.
     The file `/etc/crash_reporter.conf` contains a list of executables and shell
     commands for them.
     If an executable crashes which is listed in this file, the shell commands
     listed will be executed as root and their output will be sent in the crash
     report.
     For instance when the update_engine (auto updater) daemon crashes, this
     allows us to send the daemon's logs (listing attempts and application-level
     logs) in the crash report.
 *   To enqueue a crash, we generate a .dmp file which is the minidump of the
     crash.
     We store the logs above in a .log file.
     We store other information in a .meta file such as the name of the
     executable that crashed and its crash timestamp.
     These three files have the same basename to form one report.
     The basename includes the crashing executable's name and time of the crash
     to help developers diagnose crashes on non-production devices.
 *   Crash statistics are generated by `crash_reporter` emitting a dbus signal
     that `metrics_daemon` receives.
     That daemon generates and emits user metrics to Chrome.
 *   These crash reports are sent by a crash sending agent also used by kernel
     crash collector.

 ## Termina virtual machine crashes

 *   ChromeOS has over time grown a number of virtual machines, including
     Termina, a VM for running Linux applications the ChromeOS won't support
     natively. The user space crash handling described above won't catch any
     crashes here.
 *   Inside Termina we gather information about crashes using the normal
     user space crash flow described above.
 *   Once the crash information is gathered inside Termina, instead of writing it
     to a spool directory for the crash sender, it gets sent out of Termina to a
     daemon (cicerone) running on the host.
 *   This daemon then invokes a VM collector on the host and passes it the
     information from Termina, which writes out the crash report.
 *   Cicerone has intentionally limited privileges due to its interaction with
     untrusted VMs, which means it (and any process it invokes) can't write
     directly to the regular spool directories. Instead we write the crash report
     to `/home/root/<user_hash>/crash/` which only requires being a member of the
     group `crash-user-access`.

 ## Kernel crashes

 *   Upon a kernel panic occurring (which can happen when the kernel crashes with
     unexpected memory accesses and also with oops or BUG stat, a procedure is
     called which copies the current contents of the kernel debug buffer into a
     region of memory called "kcrash" memory.
 *   This memory can be accessed from user space by reading the
     `/sys/kernel/debug/preserved/kcrash` file.
 *   This kcrash memory is handled specially by the ChromeOS firmware when
     reboots occur.
 *   Upon writing to this memory area, the kernel panic handler causes the system
     to reboot.
 *   Upon restarting, crash_reporter checks the kcrash memory area, copies out
     its data to a crash report, analyzes the crash report for stack traces that
     signify the cause of the error, generates a hash/fingerprint of the stack,
     and enqueues that information in the kernel crash report.
     It then clears the kcrash area.

 ## Unclean shutdowns

 *   Upon start up, crash_reporter is run.
     It creates a file on stateful partition to indicate that the current state
     is startup without clean shutdown.
 *   Upon clean shutdown, crash_reporter is run.
     The stateful partition file is removed to indicate the current state is
     clean shutdown last occurred.
 *   If upon start, the file already exists before crash_reporter attempts to
     create it, this signifies that the system was in the state of startup
     without clean shutdown.
     This signals an unclean shutdown.
     A signal is enqueued for metrics_daemon to emit user metrics about this
     unclean shutdown.
 *   No diagnostics are currently collected for unclean shutdowns.

 ## Crash sending agent

 *   Runs hourly and checks `/var/spool/crash`, `/home/chronos/user/crash`, and
     `/home/root/<user_hash>/crash` for reports, sends those, and removes them if
     successfully sent.
 *   Rate limits to 32 crash diagnostics uploads in 24 hours across entire
     system.
 *   We rely upon Google crash server to collect user space crash diagnostics for
     further analysis.
     We already know that it scales well to large numbers of Google Toolbar and
     Chrome desktop users.

 # Alternatives Considered

 One/some of these alternatives may indeed be what we implement in the longer
 run.

 ## User space diagnostics

 We considered linking [Google Breakpad] into every process with extra logic that
 determines where and how to store crash dumps.
 This was our first implementation.
 Unfortunately we cannot affect the linking of every process (since some come to
 Google as binary format).
 Also the act of installing a signal handler in every process can be disruptive.
 This could also be done at the libc level, such as how Android added to Bionic
 (their libc replacement) the ability to catch unhandled segfaults in process and
 signal a debugger process in the system.
 While possible, it seems tricky to be installing this into every process.
 The timing of when the library is initialized would be tricky, as well as
 watching for infinite loops (what if the crash sending process crashes).

 [Apport]: https://wiki.ubuntu.com/Apport
 [Firmware event logs]: https://github.com/dhendrix/firmware-event-log/blob/wiki/FirmwareEventLogDesign.md
 [Google Breakpad]: https://chromium.googlesource.com/breakpad/breakpad
 [kexec]: https://en.wikipedia.org/wiki/Kexec
 [Linux Kernel Crash Dump]: http://lkcd.sourceforge.net/
 [LKCD]: http://lkcd.sourceforge.net/
 [WINQUAL]: https://en.wikipedia.org/wiki/Winqual
 [/proc/sys/kernel/core_pattern]: http://man7.org/linux/man-pages/man5/core.5.html
	# ChromiumOS Crash Reporting (original design doc)

	2011-05-15

	*** note
	This is the original design doc for this project.
	Some aspects are out of date or no longer accurate, but is still useful for
	historical reference purposes.
	We don't plan on updating or "fixing" content here as the current state of the
	project is better reflected in other files.
	Start with the [README](../README.md).
	***

	[TOC]

	# Objective

	We intend to design a system for generating statistics and diagnostic
	information on a variety of crashes.

	Our goals for devices that are opted into automatically submitting usage
	feedback and crash reports to Google:

	* Detect crashes in any system process and accumulate a User Metrics counter
	of number of user space crashes, usage time between crashes, and crashes per
	day/week.
	* Detect unclean shutdowns on the system and accumulate a User Metrics counter
	of number of unclean shutdowns, usage time between unclean shutdowns, and
	unclean shutdowns per day/week.
	* Detect kernel crashes (Linux kernel panics) on the system and accumulate a
	User Metrics counter of number of kernel crashes, usage time between
	crashes, and crashes per day/week.
	* Generate diagnostic information for any user space process that crashes with
	enough information to generate stack traces server-side.
	* Generate other diagnostic information that is application specific for
	specific user space process crashes.
	* Generate diagnostic information for any kernel crash (panic) with kernel
	debug messages including the kernel-generated stack trace at the point of
	crash.
	* Upload diagnostics in spite of system failures (unusable components like
	Chrome or X), temporary network failures, and use a system-wide upload rate
	throttle.
	* Store crash diagnostics for user-specific crashes in the cryptohome which is
	encrypted so that any diagnostics with sensitive information that needs to
	be stored after the user logs out because of a failure described above
	cannot be viewed by any other user.
	This means we will need to upload the crash report once the user has
	provided their credentials later and is logged in.

	Our non-goals:

	* Recognizing a wide variety of very bad user space problems such as a Chrome,
	X, or Window Manager that immediate exits and causes the machine to not be
	usable.

	*> Are you talking about processes that exit with a failed assert, or
	something else?
	If the former, it seems like we'd be able to report that.*

	*> I'm not opposed to adding this to the longer-term goals, but I'm not sure
	what interface would be appropriate here - looking through syslogs for
	errors, looking through application specific logs?*

	# Background

	Our goal is to provide a stable platform.
	We need to be able to diagnose failures that do not occur in our labs or that
	are otherwise hard to reproduce automatically.

	## Existing User Space Crash Reporting

	[Google Breakpad] is used by most Google applications on Mac, Windows, and Linux
	for x86 and ARM architectures to send "minidump" crash reports, very small crash
	reports that contain enough information to provide a stack trace of all threads
	running at the time of the crash.
	[Google Breakpad] does (as of Q1 2010) support ARM but is not yet used in
	production.
	Chrome in ChromeOS currently uses [Google Breakpad] and sends crash reports
	with product ID "Chrome_ChromeOS".

	The Canonical Ubuntu Linux project uses [Apport] to handle user space crashes.
	This is a Python package that intercepts all core file writes, invokes gdb, and
	collects stack dumps into a directory which it then sends out using an Anacron
	job.
	It relies on Python and debug information being present on the target.

	The Linux kernel creates core files for processes that encounter unhandled
	signals.
	As of 2.6 kernels, the file location and naming can be customized by changing
	[/proc/sys/kernel/core_pattern].
	Once core files are written they can be manually inspected by a developer or
	advanced user.
	Additionally, this kernel parameter can be set to cause a pipe to be opened to
	a user space process which then can receive the core file that would have been
	written to its stdin.
	We will rely on this mechanism to get diagnostic information and signaling for
	all user space processes.

	On Windows, Microsoft has created the [WINQUAL] service which allows developers
	to retrieve crash diagnostics for their applications.
	When a Windows application crashes and does not handle the crash itself, the
	operating system prompts the user if they would like to send this particular
	crash report, and uploads it upon receiving consent.
	The [WINQUAL] service then aggregates and shows the reports.
	Crash reports can be sent as full dumps or as minidumps using the same format
	that [Google Breakpad] uses.

	## Existing Kernel Space Crash Reporting

	[Linux Kernel Crash Dump] ([LKCD]) is a set of kernel patches and a user-space
	application that enables a panicked Linux kernel to write crash diagnostics to
	its swap partition and then diagnose the crash and store it in a simplified
	form.
	It provides a command-line utility for diagnosing kernel state, but requires a
	fairly large file to be uploaded to diagnose the running kernel state remotely.
	This patch was last updated in 2005 and is an invasive kernel patch that's
	difficult to maintain and will never go upstream.

	[kexec] based dump - is a method where the kernel "exec"s a new kernel that is
	stable into a reserved area of memory without performing a system reset first.
	The new stable kernel writes out the state and data of the old kernel to disk,
	whilst only operating from the reserved memory area.
	When any relevant state is written, the rest of memory is reclaimed and
	initialized and full system boot is completed.
	The patches for kexec are already upstream.

	http://www.kerneloops.org/ - Collects crash dumps and provides a dashboard for
	all kernel developers to find crashes common across all versions, as well as
	specific to vendors/distributors.
	Provided they have enough server-side capacity to handle crash dumps from
	ChromeOS scale numbers of machines this is an option.
	kerneloops provides a user space process that runs at startup, prompts the user
	if they want to upload the kernel crash, and uploads an analyzed result.

	Ubuntu uses [Apport] and [LKCD] to handle kernel crashes.
	It invokes lcrash to perform crash analysis on the vmcore image.

	## Existing Firmware Space Crash Reporting

	[Firmware event logs] can be stored in non-volatile storage.
	Traditionally problems during firmware initialization as well as kernel
	panic/problems can be placed here.

	# Requirements and Scale

	## Crash Diagnostic Information Collection

	There can be different systems for recording kernel and user space crash
	diagnostic information.
	We ideally want stack traces at time of crash for either kind of crash.
	Some kinds of kernel crashes (i.e. crashes in interrupt handlers) by their
	nature will not be able to generate/persist any diagnostic information.

	For user space crashes, we would need:
	* Identification of executable
	* The context (parameters, environment, cwd)
	* Stack trace (this is nice to have but also difficult in general)

	We will use rate limiting to avoid flooding our servers with crash diagnostics.
	We will limit to 8 crash reports per machine per day.

	We need to build executables and libraries with debug information and upload
	these to crash server for those for which we would like stack traces with proper
	symbolic information.

	## Crash Statistics Collection

	We would like to have statistics on how often crashes are occurring in the
	field.
	For every release of ChromeOS on every device we would like to know how
	frequent unclean shutdowns, user space process, and kernel crashes are.
	Ideally we can know information on occurrences per individual user, for
	instance, knowing that 1% of users experience over 5 kernel panics per week.
	We will generate frequency data for these kinds of events in the course of a
	day and per week.

	## Protecting User Privacy

	We must err on the side of getting too little information if the alternative is
	to potentially send sensitive information of a user who has not enabled this.
	As such, we should be careful, for instance, to not send kernel core files as
	the kernel core may have information for a variety of users.
	We also must avoid sending log files that may capture the accumulated activities
	of multiple users.
	We will send a unique but anonymous identifier per device to find potentially
	related crashes by those which happen on the same device and to help eliminate
	crashes from buggy/broken devices.
	User space processes which crash and which interact closely with the user, such
	as Chrome, the window manager, entd, and others are more likely to have
	sensitive data in memory at the time of crash.
	For this reason we encrypt the diagnostics generated from all executables which
	run as Linux user 'chronos' (which means they are started when the user logs in
	and terminated upon log out) when stored to disk.
	Since the encryption is based on the user's password, the only way a user's
	crash diagnostics can be sent is when they are currently logged in to the
	device.

	# Design Ideas

	We will separate kernel and user space diagnostic gathering in implementation.
	Both however, need to adhere to our EULA with the user.
	During the out of box experience the owner chooses if he/she would like crashes
	on this device to be uploaded to Google servers.
	We must never send a crash report if they do not give consent.
	They may rescind their consent at any time which means that if we have enqueued
	a crash report to be sent which was created at a time when the user consented,
	and they rescind their consent before the crash is sent, the crash report must
	be discarded.

	## User space crash handling

	* Upon a crash occurring, the kernel invokes crash_reporter indicating the
	name and process ID of the crashing process and pipes in to it a full core
	dump of the process.
	* Chrome links in [Google Breakpad] directly and handles its own crashes and
	uploads them.
	It generates its own statistics for various kinds of crashes and sends
	application specific information (last URL visited, whether the process was
	a renderer, browser, plugin, or other kind of process).
	The system crash handling mechanism will ignore Chrome crashes since they
	are already handled internally.
	* crash_reporter will invoke core2md to convert the full core dump to the
	minidump format.
	This process involves reading the core file contents to determine number of
	threads, register sets of all threads, and threads' stacks' contents.
	We created core2md by modifying [Google Breakpad] specifically for Chrome
	OS's uses.
	[Google Breakpad] is normally linked directly into individual executables
	whose crashes we want to generate diagnostics for.
	Since ChromeOS has hundreds of these executables, catching signals can
	interfere with executables' own code, and some executables are only
	delivered to Google in binary form, we found the conversion of full core
	files from the kernel to minidump files to be a superior way to generate
	crash diagnostics for the entire system.
	* When a crash occurs, we consider the effective user ID of the process which
	crashed to indicate if the crash report should be encrypted due to having
	higher risk of containing sensitive information.
	If the crashed process was running as `chronos` we enqueue its crash to
	`/home/chronos/user/crash` which is on the cryptohome when a user is logged
	in and so it will be encrypted.
	If the crashed process was running as any other user, we enqueue the crash
	in `/var/spool/crash`.
	* In the future, encrypted crash reports will go to
	`/home/root/<user_hash>/crash/`. This directory is still part of the
	cryptohome, but can be accessed without running as chronos. This will allow
	both creating and uploading crash reports with lower privileges.
	* The name of the crash is used to determine if we should gather additional
	diagnostic information.
	The file `/etc/crash_reporter.conf` contains a list of executables and shell
	commands for them.
	If an executable crashes which is listed in this file, the shell commands
	listed will be executed as root and their output will be sent in the crash
	report.
	For instance when the update_engine (auto updater) daemon crashes, this
	allows us to send the daemon's logs (listing attempts and application-level
	logs) in the crash report.
	* To enqueue a crash, we generate a .dmp file which is the minidump of the
	crash.
	We store the logs above in a .log file.
	We store other information in a .meta file such as the name of the
	executable that crashed and its crash timestamp.
	These three files have the same basename to form one report.
	The basename includes the crashing executable's name and time of the crash
	to help developers diagnose crashes on non-production devices.
	* Crash statistics are generated by `crash_reporter` emitting a dbus signal
	that `metrics_daemon` receives.
	That daemon generates and emits user metrics to Chrome.
	* These crash reports are sent by a crash sending agent also used by kernel
	crash collector.

	## Termina virtual machine crashes

	* ChromeOS has over time grown a number of virtual machines, including
	Termina, a VM for running Linux applications the ChromeOS won't support
	natively. The user space crash handling described above won't catch any
	crashes here.
	* Inside Termina we gather information about crashes using the normal
	user space crash flow described above.
	* Once the crash information is gathered inside Termina, instead of writing it
	to a spool directory for the crash sender, it gets sent out of Termina to a
	daemon (cicerone) running on the host.
	* This daemon then invokes a VM collector on the host and passes it the
	information from Termina, which writes out the crash report.
	* Cicerone has intentionally limited privileges due to its interaction with
	untrusted VMs, which means it (and any process it invokes) can't write
	directly to the regular spool directories. Instead we write the crash report
	to `/home/root/<user_hash>/crash/` which only requires being a member of the
	group `crash-user-access`.

	## Kernel crashes

	* Upon a kernel panic occurring (which can happen when the kernel crashes with
	unexpected memory accesses and also with oops or BUG stat, a procedure is
	called which copies the current contents of the kernel debug buffer into a
	region of memory called "kcrash" memory.
	* This memory can be accessed from user space by reading the
	`/sys/kernel/debug/preserved/kcrash` file.
	* This kcrash memory is handled specially by the ChromeOS firmware when
	reboots occur.
	* Upon writing to this memory area, the kernel panic handler causes the system
	to reboot.
	* Upon restarting, crash_reporter checks the kcrash memory area, copies out
	its data to a crash report, analyzes the crash report for stack traces that
	signify the cause of the error, generates a hash/fingerprint of the stack,
	and enqueues that information in the kernel crash report.
	It then clears the kcrash area.

	## Unclean shutdowns

	* Upon start up, crash_reporter is run.
	It creates a file on stateful partition to indicate that the current state
	is startup without clean shutdown.
	* Upon clean shutdown, crash_reporter is run.
	The stateful partition file is removed to indicate the current state is
	clean shutdown last occurred.
	* If upon start, the file already exists before crash_reporter attempts to
	create it, this signifies that the system was in the state of startup
	without clean shutdown.
	This signals an unclean shutdown.
	A signal is enqueued for metrics_daemon to emit user metrics about this
	unclean shutdown.
	* No diagnostics are currently collected for unclean shutdowns.

	## Crash sending agent

	* Runs hourly and checks `/var/spool/crash`, `/home/chronos/user/crash`, and
	`/home/root/<user_hash>/crash` for reports, sends those, and removes them if
	successfully sent.
	* Rate limits to 32 crash diagnostics uploads in 24 hours across entire
	system.
	* We rely upon Google crash server to collect user space crash diagnostics for
	further analysis.
	We already know that it scales well to large numbers of Google Toolbar and
	Chrome desktop users.

	# Alternatives Considered

	One/some of these alternatives may indeed be what we implement in the longer
	run.

	## User space diagnostics

	We considered linking [Google Breakpad] into every process with extra logic that
	determines where and how to store crash dumps.
	This was our first implementation.
	Unfortunately we cannot affect the linking of every process (since some come to
	Google as binary format).
	Also the act of installing a signal handler in every process can be disruptive.
	This could also be done at the libc level, such as how Android added to Bionic
	(their libc replacement) the ability to catch unhandled segfaults in process and
	signal a debugger process in the system.
	While possible, it seems tricky to be installing this into every process.
	The timing of when the library is initialized would be tricky, as well as
	watching for infinite loops (what if the crash sending process crashes).

	[Apport]: https://wiki.ubuntu.com/Apport
	[Firmware event logs]: https://github.com/dhendrix/firmware-event-log/blob/wiki/FirmwareEventLogDesign.md
	[Google Breakpad]: https://chromium.googlesource.com/breakpad/breakpad
	[kexec]: https://en.wikipedia.org/wiki/Kexec
	[Linux Kernel Crash Dump]: http://lkcd.sourceforge.net/
	[LKCD]: http://lkcd.sourceforge.net/
	[WINQUAL]: https://en.wikipedia.org/wiki/Winqual
	[/proc/sys/kernel/core_pattern]: http://man7.org/linux/man-pages/man5/core.5.html