Starting from COS LTS 73 (cos-dev-73-11647-29-0), COS images support kernel crash dump feature. This feature, when enabled, captures a full kernel memory crash dump in the event of a kernel crash and saves it locally on the instance’s boot disk. You can download the report and attach it to a Google Cloud Platform Support Case to help debug the crash. You do not need to analyze the report yourself.
The Kernel Crash Dump Collection tool is based on the open source kdump solution, and operates only within the guest OS. It includes a secondary dump-capture kernel, a dump-capture userspace, and userspace tools for managing the kdump functionality.
Before you begin, there are some limitations:
Secure Boot feature is disabled by default. But if it is enabled on the COS instance, you can disable it via:
$ gcloud compute instances stop [INSTANCE_NAME] $ gcloud compute instances update [INSTANCE_NAME] --no-shielded-secure-boot $ gcloud compute instances start [INSTANCE_NAME]
kdump feature reserves a certain amount of system memory (64MB - 512MB depending on machine size) that cannot be used for any other purpose.
kdump has a dependency on the boot disk. So if the boot disk is full or corrupted, kdump may fail.
When booted in the dump-capture kernel, the instance will be inaccessible to the user. This is because many userspace components (such as sshd, kubelet, konlet and cloud-init) won’t be started in the dump-capture kernel. The best way to view the instance’s activity during kdump, is to inspect its serial port output.
GKE started using COS 73 since version 1.13.5-gke.7. Kernel crash dump collection feature is only available on 1.13.5-gke.7 or newer GKE clusters.
Enabling Kernel Crash Dump requires a node reboot. So we recommend that you create a node pool with Kernel Crash Dump enabled, and then migrate the workload to the new node pool.
To create a node pool with Kernel Crash Dump enabled:
$ gcloud container node-pools create kdump-enabled --cluster=[CLUSTER_NAME] \ --node-labels=cloud.google.com/gke-kdump-enabled=true
cloud.google.com/gke-kdump-enabled=truelabel. It will enable Kernel Crash Dump and then reboot the node.
$ kubectl create -f \ https://raw.githubusercontent.com/GoogleCloudPlatform/\ k8s-node-tools/master/enable-kdump/cos-enable-kdump.yaml
$ kubectl get pods --selector=name=enable-kdump -n kube-system
You should get a response similar to:
NAME READY STATUS RESTARTS AGE enable-kdump-68bmw 1/1 Running 0 6m
$ kubectl logs enable-kdump-68bmw enable-kdump -n kube-system
You should get a response similar to:
kdump enabled: true kdump ready: true kdump kernel loaded: true kdump kernel /boot/kdump/vmlinuz is loaded with command line parameter: systemd.unit=kdump-save-dump.service noinitrd console=ttyS0 root=PARTUUID=E3438A34-19F5- 3044-897F-4F5428D985F4 maxcpus=1 kdump is enabled and ready. No reboot required.
You must keep the DaemonSet running on the node pools so that new nodes created in the pool will have the changes applied automatically. Node creations can be triggered by node auto repair, manual or auto upgrade, and auto-scaling.
To disable Kernel Crash Dump, you will need to recreate the node pool without deploying the provided DaemonSet, and migrate your workloads to the new node pool.
To create the new node pool with Kernel Crash Dump disabled:
$ gcloud container node-pools create kdump-disabled --cluster=[CLUSTER_NAME]
To enable the Kernel Crash Dump Collection tool on a GCE COS instance, run the kdump_helper enable command on the instance, then reboot the system.
Note: Rebooting is required.
$ sudo kdump_helper enable $ sudo reboot
In the event of a kernel crash, the crash dump will be stored on the instance’s local boot disk.
To disable the Kernel Crash Dump Collection tool on a GCE COS instance, use the kdump_helper disable command, then reboot:
$ sudo kdump_helper disable $ sudo reboot
Existing crash dumps are not deleted automatically.
sosreport tool collects crash dumps along with some other debugging information. See sosreport documentation for instructions on sharing the report with Google.
The dump file can be inspected with the crash utility.
To remove all existing reports from the instance, run
$ sudo kdump_helper cleanup
The serial port output of the COS instance will have the logs from dump-capture kernel, and will indicate what went wrong: