Kdump and crash configuration methods and failure analysis methods

  

Linux kernel is very stable, but it is still inevitable to encounter a crash, get the memory image when the kernel crashes, help analyze what happened before the system crashed Analyze the cause and fix the error to further improve the stability of the system. 【正文】

Introduction to kdump
kdump is currently the most effective linux memory image collection mechanism. It is widely used in various products of major Linux vendors and plays an irreplaceable role in the debug kernel. effect. Kdump is a kexec-based Linux kernel crash capture mechanism that saves the memory image before the kernel crash. The programmer analyzes the file to find out the cause of the kernel crash and makes system improvements. Kdump is used to dump memory images. It can not only dump memory images to local hard disks, but also dump memory images to devices of different machines through protocols such as NFS and SSH. Kdump is divided into two components: Kexec and Kdump. Kexec is a kernel quick-start tool that enables new kernels to be launched in the context of a running kernel (production kernel) without the need for time-consuming BIOS detection, making it easy for kernel developers to debug the kernel. Kdump is an effective memory dump tool. After Kdump is enabled, the production kernel will reserve a part of the memory space for quick booting to the new kernel through Kexec when the kernel crashes. This process does not require restarting the system, so it can be dumped. A memory image of a crashed production kernel. ;

Two kdump installation configuration

2.1 Installation package
The various tools used by Kdump are in kexec-tools. Kernel-debuginfo is used to analyze vmcore files. Starting with rhel5, kexec-tools has been installed by default in the distribution. If you need to debug the vmcore file generated by kdump, you need to manually install the kernel-debuginfo package. Check the installation package operation, note that the kernel-debuginfo and kernel-debuginfo versions are the same as the kernel version:

2.2 Configuring the kdump configuration file
Adding kernel parameters to the /boot/grub/grub.conf file"crashkernel= Y@X", where Y is the memory reserved for the kdump capture kernel, and X is the starting position for the reserved portion of memory. For i386 and x86_64, edit /etc/grub.conf and add "crashkernel=128M" at the end of the kernel line. It is also recommended not to set to crashkernel=auto, because the introduction of "rhel6" is already abandoned.

2.3 Configure some parameters of kdump.
kdump stores the file in the /var/crash/directory of the local hard disk by default. This location can be a directory of the local file system, or a block device, or stored on another machine through the network, you can modify the file. :cat /etc/kdump.conf…#raw /dev/sda5 (If you write to a raw device, you need to uncomment the line) #ext4 /dev/sda3 (Specify the file system type and partition to write to the device)# Ext4 LABEL=/boot (supports LABEL and UUID to identify devices) #ext4 UUID=03138356-5e61-4ab3-b58e-27507ac41937#net my.server.com:/export/tmp (NFS write)#net user@my .server.com (SSH mode write) path /var/crash (memory path of memory image file) core_collector makedumpfile -c --message-level 1 -d 31 (set the level of memory image content saved, -c means use makedumpfile Compressed data, --message-devel 1 indicates the level of the prompt information, 1 indicates that only the progress information is displayed. -d 31 indicates that all memory pages that can be removed are not copied, including zero page, ca Che page, cache private, user data, free page, etc. #default shell (represents the action performed after kdump dumps the memory mirror failure, the default is to mount the root file system and execute the /sbin/init process, which can be changed to reboot , halt, poweroff, shell, etc.)

2.4 Set the kdump service to boot from boot.

2.5 Verify that the configuration is successful.
Use the command to start the kernel crash, such as generating a kernel image file in the specified directory, indicating that the configuration is successful.

Three crash introduction
crash is a widely used analysis tool for linux kernel crash dump files. Mastering the crash usage skills is very important for analyzing and locating kernel crashes. For kernel developers, crash has become an indispensable tool.

Four crash configuration and initial use

4.1 Check the installation package

4.2 Use crash to analyze the memory dump file.
The two parameters are debug kernel and dump file, respectively, the kernel with debugging information and the kernel dump file generated by the crash: the meaning of each parameter is: Kernel—— indicates the location and version information of the debug kernel DUMPFILE&mdash ;— indicates the analyzed memory dump image CPUS—— indicates the number of CPUs of the machine DATE—— indicates the time when the kernel crash occurred UPTIME—— indicates that the kernel has been running normally LOAD AVERAGE—— indicates kernel crash The system load TASKS—— indicates the number of tasks the system runs when the kernel crashes. NODENAME—— The host name of the machine that appears to be kernel crashes RELEASE—— indicates that the kernel's release version VERSION—— indicates other version information of the kernel MACHINE&mdash ;— indicates CPU architecture and frequency information MEMORY—— indicates the memory size of the system where the kernel crashed PANIC—— indicates the type of crash. There may be SysRq, the kernel crash caused by the system request; Oops, indicating that the kernel has unpredictable or incorrect behavior, will kill the corresponding process, the kernel may return to normal, or may be in an uncertain state. , which leads to the Panic of the kernel; Pannic, the kernel crashes, that is, serious and unrepairable errors, such as illegal address access, forced loading or unloading of kernel modules, hardware errors, etc. PID—— indicates that the kernel crashes The process number COMMAND—— indicates the name of the process causing the kernel to crash. TASK—— indicates the memory address of the process access that caused the kernel to crash. CPU—— indicates the number of CPUs occupied by the process causing the kernel to crash. STATE—— indicates the kernel The running state of the crashed process The above information can be used to initially analyze the cause of the kernel crash. There are three error conditions in the kernel state, namely bug, oops and panic. The bug is a minor error, oops represents a user process error, need to kill the user process, then if the user process takes up some signal locks, these signal locks will never be released, which will lead to potential system instability Sex. Panic is a serious mistake that represents the entire system crash. In-depth analysis requires more commands to track and find, and requires a certain understanding of the kernel's operating mechanism and kernel development programming.

4.3Crash common command
help #View the help information of the command, or use the man command h # to view the history command, which is equivalent to the historylog under the shell # This command is used to print out the log information of the memory bt # Get the current thread's call stack foreach bt # This command is used to get the call stack of all threads ps # This command is used to view the process information of the kernel crash vm # This command is used to view the virtual memory information of the current kernel context files # This command is used to view the file opened in the current kernel context exit or q #Exit Crash

4.4 Use the bt command to view the stack
[exception RIP: sysrq_handle_crash+22] field shows that the instruction pointer is abnormal, the reason is the kernel The function sysrq_handle_crash has an error with an offset of 22. As you can see, this is also related to the file system. The CS:0010 field indicates the information of the code segment register. The lowest bit is 0, indicating that the current process's privilege level (CPL) is 0, indicating that the error occurred in kernel space.

4.5 The log command can print the system message buffer, which may find clues to the system crash (the screenshot has omitted some lines, only the end)
Similar to the bt result, the kernel prints the stack information of the system call through Call Trace. For analysis.

The 4.6 ps command is used to display the status of the process. (Figure) The > flag represents the active process.
where swapper represents the system's exchange process, is part of the kernel, responsible for the scheduling of kernel tasks, its pid is 0, each CPU will have a swapper process. Here you can see the status of the process bash that caused the kernel to crash, the process number is 1728.

4.7 files command can view the file opened in the current kernel context
You can see the related file operation /proc/sysrq-trigger

4.8 vm command to view virtual memory information

4.9 Mount command to view the mount status

4.10 net command to view simple network information

five conclusions
When the Linux system kernel crashes, you can collect the kernel crash by kdump, etc. Memory, generate a dump file vmcore. By analyzing the vmcore file, the kernel developer can diagnose the cause of the kernel crash and improve the code of the operating system
. Crash is a widely used kernel crash dump file analysis tool. Through the close cooperation of kdump and crash, many problems can be eliminated.

Copyright © Windows knowledge All Rights Reserved