[RHEL] How to troubleshoot kernel crashes, hangs, or reboots with kdump on Red Hat Enterprise Linux

2014. 8. 14. 14:25

How to troubleshoot kernel crashes, hangs, or reboots with kdump on Red Hat Enterprise Linux

10시 52분 목요일 업데이트

등급 매기기

Available Translations

확인됨

이 솔루션은 확인되어 Red Hat 고객 및 지원 엔지니어가 지정한 제품 버전에서 사용할 수 있습니다.

이 솔루션은 Red Hat의 신속 게시 프로그램의 일부로, 고객을 지원할 때 Red Hat 엔지니어가 만든 거대한 솔루션 라이브러리를 제공합니다. 이러한 문서는 필요한 지식을 사용 가능한 시점에 곧바로 제공하기 위해 편집하지 않은 원래 상태로 표시될 수 있습니다.

문제

How do I configure kexec/kdump on RHEL?
How much disk space is required for kdump to dump the vmcore?
Need RCA of kernel crash/panic
How do I troubleshoot a server crash/reboot?
Want root cause of a system reboot
How do I capture a vmcore on my server?
My server hung or became unresponsive, how to troubleshoot?
Problem collecting a core file with kdump on a host
How much time is required to capture vmcore?
System freezes unexpectedly, how to troubleshoot?

환경

Red Hat Enterprise Linux (RHEL) 5
Red Hat Enterprise Linux (RHEL) 6

해결

For RHEL 3 and RHEL 4, netdump must be used. Refer to How do I configure netdump on Red Hat Enterprise Linux 3 and 4?
For Xen guests, xendump must be used. Refer to How do I configure Xendump on Red Hat Enterprise Linux 5?
For KVM and RHEV, refer to How to capture vmcore dump from a KVM guest?

Note: KVM and RHEV guests are not required to use the above method, though it is an additional option for capturing a vmcore when the VM is unresponsive.

Background / Overview

kexec is a fastboot mechanism that allows booting a Linux kernel from the context of an already running kernel without going through the BIOS. Since BIOS checks at startup can be very time consuming (especially on big servers with numerous peripherals), kexec can save a lot of time for developers who need to reboot a machine often for testing purposes. Using kexec for rebooting into a normal kernel is simple, but not within the scope of this article. See the kexec(1) man page.

kdump is a reliable kernel crash-dumping mechanism that utilizes the kexec software. The crash dumps are captured from the context of a freshly booted kernel; not from the context of the crashed kernel. Kdump uses kexec to boot into a second kernel whenever the system crashes. This second kernel, often called a capture kernel, boots with very little memory and captures the dump image.

The first kernel reserves a section of memory that the second kernel uses to boot. Be aware that the memory reserved for the kdump kernel at boot time cannot be used by the standard kernel, which changes the actual minimum memory requirements of Red Hat Enterprise Linux. To compute the actual minimum memory requirements for a system, refer to Red Hat Enterprise Linux 6 technology capabilities and limit for the listed minimum memory requirements and add the amount of memory used by kdump to determine the actual minimum memory requirements.

Using kdump allows booting the capture kernel without going through BIOS hence the contents of the first kernel's memory are preserved, which is essentially the kernel crash dump.

The following instructions must be followed in order to start capturing kernel cores with kdump.

Prerequisites

For dumping cores to a network target, access to a server over NFS or ssh is required.
Whether dumping locally or to a network target, a device or directory with enough free disk space is needed to hold the core. See the "Sizing Local Dump Targets" section below for more information on how much space will be needed.
For configuring kdump on a system running a Xen kernel, it is required to have a regular kernel of the same version as the running Xen kernel installed on the system. (If the system is 32-bit with more than 4GB of RAM, kernel-pae should be installed alongside kernel-xen instead ofkernel.) Note: The kernel need only be installed. You can continue running the Xen kernel, and no reboot is required.

Installing kdump

Verify the kexec-tools package is installed:

# rpm -q kexec-tools

If it is not installed, proceed to install it via yum:

# yum install kexec-tools

On IBM Power (ppc64) and IBM System z (s390x), the capture kernel is provided in a separate package called kernel-kdump which must be installed for kdump to function:

# yum install kernel-kdump

This package is not necessary (and in fact does not exist) on other architectures.

Adding Boot Parameters

Red Hat provides a KDump Helper tool to help you set up kdump within RHEL 5/6 kernels. Input a minimum amount of information and this tool will generate an all-in-one script for you to set up kdump with a very basic configuration, or you can generate a script to set up kdump with extended configurations for a number of particular scenarios like system hang, Process D state, or soft lockup. Running the generated script will figure out the correct crashkernel= parameter and add it to the currently active grub menu line. Read the KDump Helper Blog and leave feedback at KDump Helper App Info. The KDump Helper helps automate the following steps.

The option crashkernel must be added to the kernel command line parameters in order to reserve memory for the kdump kernel:

For i386 and x86_64 architectures on RHEL 5, edit /boot/grub/grub.conf, and appendcrashkernel=128M@16M to the end of the kernel line. {using @16M on Rhel6 has caused kdump to fail}
For RHEL 6 i386 and x86_64 systems, use crashkernel=128M

It may be possible to use less than 128M, but testing with only 64M has proven unreliable.

For more information regarding the crashkernel parameter on RHEL 6, refer to How should the crashkernel parameter be configured for using kdump on RHEL6?

The following is an example of /boot/grub/grub.conf with the kdump options added for RHEL 5:

# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE:  You do not have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /, eg.
#          root (hd0,0)
#          kernel /boot/vmlinuz-version ro root=/dev/hda1
#          initrd /boot/initrd-version.img
#boot=/dev/hda
default=0
timeout=5
splashimage=(hd0,0)/boot/grub/splash.xpm.gz
hiddenmenu
title Red Hat Enterprise Linux Client (2.6.17-1.2519.4.21.el5)
        root (hd0,0)
        kernel /boot/vmlinuz-2.6.17-1.2519.4.21.el5 ro root=LABEL=/ rhgb quiet crashkernel=128M@16M
        initrd /boot/initrd-2.6.17-1.2519.4.21.el5.img

Or for RHEL 6:

# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE:  You have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /boot/, eg.
#          root (hd0,0)
#          kernel /vmlinuz-version ro root=/dev/mapper/vg_example-lv_root
#          initrd /initrd-[generic-]version.img
# boot=/dev/vda
default=0
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title Red Hat Enterprise Linux Server (2.6.32-71.7.1.el6.x86\_64)
       root (hd0,0)
       kernel /vmlinuz-2.6.32-71.7.1.el6.x86_64 ro root=/dev/mapper/vg_example-lv_root rd_LVM_LV=vg_example/lv_root rd_LVM_LV=vg_example/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=128M rhgb quiet
       initrd /initramfs-2.6.32-71.7.1.el6.x86_64.img

If you are using a Xen kernel on RHEL 5, you will need to add the crashkernel parameter at the end of the kernel commandline, not the module command line even though the module command line references the vmlinuz Linux kernel.

For RHEL 5 when using a Xen kernel:

# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE:  You do not have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /, eg.
#          root (hd0,0)
#          kernel /boot/vmlinuz-version ro root=/dev/hda1
#          initrd /boot/initrd-version.img
# boot=/dev/hda
default=0
timeout=5
splashimage=(hd0,0)/boot/grub/splash.xpm.gz
hiddenmenu
title Red Hat Enterprise Linux Server (2.6.18-194.17.1.el5xen)
       root (hd0,0)
       kernel /xen.gz-2.6.18-194.17.1.el5 crashkernel=128M@16M
       module /vmlinuz-2.6.18-194.17.1.el5xen ro root=/dev/myvg/rootvol
       module /initrd-2.6.18-194.17.1.el5xen.img

After adding the crashkernel parameter the system must be rebooted for the crashkernel memory to be reserved for use by kdump. This reboot can be performed now or after the below steps to configure kdump have been completed.

Please note: If the kdump service is not configured to start on boot, the crashkernel= will be not be set aside. In order to fully configure and bring the service online, the crashkernel= must be in place andchkconfig kdump on command be executed prior to a reboot.

Specifying the Kdump Location

The location of the kdump vmcore can be specified in /etc/kdump.conf. You can either dump directly to a device, to a file, or to some location on the network via NFS or SSH. For RHEL6, if a target location is not specified in the configuration, default values will be used resulting in cores being saved to/var/crash on the root file system. For information about supported dump targets see What targets are supported for use with kdump?

Dumping Directly to a Device

Kdump can be configured to dump directly to a device by using the raw directive in /etc/kdump.conf. The syntax to be used is:

raw *<devicename>*

For example:

raw /dev/sda1

This will overwrite any data that was previously on the device.

Dumping to a file on Disk

kdump can be configured to mount a partition and dump to a file on disk. This is done by specifying the filesystem type followed by the device /etc/kdump.conf. The device may be specified as a device node, a filesystem label, or filesystem UUID in the same manner as /etc/fstab. For example:

    ext3 /dev/sda1

will mount `/dev/sda1` as an ext3 device and dump the core to `/var/crash/` directory (creating it if necessary), whereas:

    ext3 LABEL=/boot

will mount the device that is ext3 with the label `/boot` and use that to dump the core.

The label may need to be set manually for storage devices that have been configured after Red Hat Enterprise Linux has been installed. For example, the following will set the label 'crash' on the storage device '/dev/sdd1':

    e2label /dev/sdd1 crash

To view the label for a storage device, run 'e2label' with the device as the only argument:

    e2label /dev/sdd1

An easy way to find how to specify the device is to look at what you're currently using in /etc/fstab (the filesystem you're dumping to does not need to be persistently mounted via fstab). The default directory in which the core will be dumped is <filesystem>/var/crash/*<date>*/ where *<date>* is the current date at the time of the crash dump. This can be changed by using the path directive in /etc/kdump.conf. For example:

    ext3 UUID=f15759be-89d4-46c4-9e1d-1b67e5b5da82 
    path /usr/local/cores

will dump the vmcore to <filesystem>/usr/local/cores/ instead of the default <filesystem>/var/crash/location.

Dumping to a Network Device using NFS

To configure kdump to dump to an NFS mount, edit /etc/kdump.conf and add a line with the following format:

net *<nfs server>:</nfs/mount>*

For example:

net nfs.example.com:/export/vmcores

This will dump the vmcore to /export/vmcores/*<hostname>*-*<date>*/ on the server nfs.example.com. The client system must have access to write to this mount point.

Please note NFSv4 is supported from RHEL-6.3 onwards

When dumping to a network location over a bonded interface, it may be necessary to define the bonding module options in the kdump.conf file. See kdump doesn't accept module options from ifcfg-* files for more information.

Dumping to a Network Device using SSH

SSH has the advantage of encrypting network communications while dumping. For this reason this is the best solution when you're required to dump a vmcore across a publicly accessible network such as the Internet or a corporate WAN:

net *<user>@<ssh server>*

For example:

net kdump@crash.example.com

In this case, kdump will use scp to connect to the crash.example.com server using the kdump user. It will copy the vmcore to the /var/crash/*<hostname>*-*<date>*``*/* directory. The kdump user will need the necessary write permissions on the remote server. Additionally, when first configuring kdump to use SSH, it will attempt to use the mktemp binary on the target system to ensure write permissions in the target path. If your kdump target server is running an operating system without the mktemp binary, you will need to use a different method to save a vmcore to that target.

To make this change take effect, run one of the following commands:

In RHEL 6 and earlier:

# service kdump propagate
Generating new ssh keys... done,
kdump@crash.example.com's password:
/root/.ssh/kdump_id_rsa.pub has been added to
~kdump/.ssh/authorized_keys2 on crash.example.com

In RHEL 7 and later (using systemd):

# kdumpctl propagate
Using existing keys...
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@crashtarget's password: 
Number of key(s) added: 1
Now try logging into the machine, with:   "ssh 'root@crashtarget"
and check to make sure that only the key(s) you wanted were added.
/root/.ssh/kdump_id_rsa has been added to ~root/.ssh/authorized_keys on crashtarget

Make sure the free diskspace of the partition or network location which you specified for storing the vmcore is at least larger than the whole physical memory on this system.

Dumping to a SAN Device (For RHEL5)

Get the wwid for the SAN paths:
```
# /sbin/scsi_id -g -u -s /block/sd<x>
```
Blacklist this LUN from multipath by editing /etc/multipath.conf:
```
blacklist {
  wwid "3600601f0d057000019fc7845f46fe011"  
}
```
Reload multipath configuration:
```
# /etc/init.d/multipathd reload
```
Now let's get a partition created on the lun, make sure to select the correct one:
```
# fdisk -l  
# /sbin/scsi_id -g -u -s /block/sd<x>
# fdisk /dev/sd<x>
```
Create a Linux partition on the disk:
```
# partprobe /dev/sd<x>
```
Validate the partition is there:
```
# fdisk -l 
```
Put an ext3/ext4/xfs filesystem on it:
```
# mkfs.ext3 /dev/sd<x>1
```

Now, let's get a udev rule in place:

# cat 99-crashlun.rules
KERNEL=="sd*", BUS=="scsi", ENV{ID_SERIAL}=="3600601f0d057000019fc7845f46fe011", SYMLINK+="crashsan%n"

Trigger udev in a way as to not affect everything else:
```
# echo change > /sys/block/sd<x>/sd<x>1/uevent
```
Validate that the udev rule worked, looking for /dev/crashsan1:
```
# ls /dev/
```
Now update /etc/fstab adding the following to the end of the file:
```
/dev/crashsan1         /var/crash       ext3    defaults    0 0
```
Validate that the file system will mount automatically:
```
# mount -a 
# mount
```
Edit /etc/kdump.conf accordingly:
```
# ext3 /dev/crashsan1
```
Restart kdump:
```
# service kdump restart
```
Make sure sysrq is enabled and test the crash. WARNING! This will crash the system, so do it at a planned time if this a production system.
```
# echo 'c' > /proc/sysrq-trigger
```

Once the system boots back, check to confirm that it worked.

# tree /var/crash/
/var/crash/
|-- 2012-08-03-13:57
|   `-- vmcore
`-- lost+found

This was validated on RHEL 5:

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 5.8 (Tikanga)

# uname -a
Linux somecoolserver.redhat.com 2.6.18-308.el5 #1 SMP Fri Jan 27 17:17:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

Dumping to a SAN Device ( For RHEL6 with blacklist of multipath)

Note: This is a workaround method therefore it depends on each environment.
Please just refer to the following method.
This method is not supported by Red Hat.

Get the wwid for the SAN paths:

#/lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/sd<X>

Blacklist this lun from multipath by editing /etc/multipath.conf:
```
blacklist {
  wwid "3600601f0d057000019fc7845f46fe011"  
}
```
Reload the multipath configuration:
```
# /etc/init.d/multipathd reload  
```

Now let's get a partition created on our LUN. Be sure to select the right one:

# fdisk -l  
# /lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/sd<X>
# fdisk /dev/sd<x>

Create a Linux partition on the disk:
```
# partprobe /dev/sd<x>
```
Validate the partition is there:
```
# fdisk -l 
```
Put an ext3/ext4/xfs file system on it:
```
# mkfs.ext3 /dev/sd<x>1
```
Comment any unnecessary wwid entries in the following two files using the "#" character:
- Switch into the multipath configuration directory:
```
# cd /etc/multipath
```
- Edit the wwids file and comment out the unnecessary wwid entries (the following is an example):
```
# vi wwids
{output truncated}
# /3600144f08c3d8b000000511256f00001/
```
- Edit the bindings file and do the same (the following is an example):
```
# vi bindings
{output truncated}
# mpathc 3600144f08c3d8b000000511256f00001
```

Add the multipath configuration to the initial ramdisk (initramfs):

# dracut --force --add multipath --include /etc/multipath /etc/multipath

Now update /etc/fstab adding the following to the end of the file using the UUID:
- Check the uuid with blkid:
```
# blkid
```
- Ex: /etc/fstab:
```
UUID=4262c8fc-23ad-42b2-9c5d-af9c64d5bb92    /var/crash    ext3    defaults        0 0
```
Validate that the filesystem will mount automatically:
```
# mount -a 
# mount
```

Edit /etc/kdump.conf accordingly:

ext3 UUID=4262c8fc-23ad-42b2-9c5d-af9c64d5bb92

Restart kdump and chkconfig it on:

# service kdump restart
# chkconfig kdump on

Make sure sysrq is enabled and test the crash. WARNING! This will crash the system, so do it at a planned time if this a production system.
```
# echo 'c' > /proc/sysrq-trigger
```

Once system boots back, let's check to confirm that it worked:

# tree /var/crash/
/var/crash/
├── 127.0.0.1-2013-02-12-21:11:03
│   └── vmcore
└── lost+found

Note: Checking environments is below.

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.3 (Santiago)

# uname -a
Linux xxxxx 2.6.32-279.22.1.el6.x86_64 #1 SMP Sun Jan 13 09:21:40 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

# rpm -qa | grep kexec
kexec-tools-2.0.0-245.el6.x86_64

# rpm -qa | grep multipath
device-mapper-multipath-0.4.9-56.el6_3.1.x86_64
device-mapper-multipath-libs-0.4.9-56.el6_3.1.x86_64

Dumping to a SAN Device ( For RHEL6 with multipath device)

Note: This method is supported by Red Hat. Please read below sentences.

This configuration is only vaildate from kexec-tools-2.0.0-245.el6.x86_64 version,if user uses old kexec-tools package,user can not use multipath device for kdump.

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.3 (Santiago)

# uname -a
Linux xxxxx 2.6.32-279.22.1.el6.x86_64 #1 SMP Sun Jan 13 09:21:40 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

# rpm -qa | grep kexec
kexec-tools-2.0.0-245.el6.x86_64

# rpm -qa | grep multipath
device-mapper-multipath-0.4.9-56.el6_3.1.x86_64
device-mapper-multipath-libs-0.4.9-56.el6_3.1.x86_64

Checking multipath status

# multipath -ll
mpathf (3600144f08c3d8b000000511a51b10002) dm-7 
size=100G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 12:0:0:1 sdk 8:160 active ready running
  |- 13:0:0:1 sdm 8:192 active ready running
  |- 14:0:0:1 sdo 8:224 active ready running
  |- 15:0:0:1 sdq 65:0  active ready running
  `- 16:0:0:1 sds 65:32 active ready running

Now let's get a partition created on our lun, make sure you have the right one

# fdisk -l  
# fdisk /dev/mapper/mpath<x>

Create linux partition on the disk

# partprobe /dev/mapper/mpath<x>
# multipath -r

Validate the partition is there

# fdisk -l

Put an ext3 fs on it (probably could do ext4)

# mkfs.ext3  /dev/mapper/mpath<x>p1

Now update /etc/fstab adding the following to the end of the file
Using UUID.

Check uuid with blkid command.

# blkid

# vi /etc/fstab
  Ex:
        UUID=b2d74f2e-2dbf-4714-9787-ba1c147c4386           /var/crash            ext4     defaults,_netdev 0 0    <---for iscsi multipath
        UUID=b2d74f2e-2dbf-4714-9787-ba1c147c4386           /var/crash            ext4     defaults               0 0    <---for SAN Multipath

Validate that the partition will mount automatically

# mount -a 
# mount

Now edit /etc/kdump.conf accordingly

ext3 UUID=b2d74f2e-2dbf-4714-9787-ba1c147c4386

Restart kdump and chkconfig on.

# service kdump restart

# chkconfig kdump on

Make sure sysrq is enabled and test the crash. This will crash the system, so do it at the right time if this a production system.

# echo 'c' > /proc/sysrq-trigger

Once system boots back check to confirm that it worked

# tree /var/crash/
/var/crash/
├── 127.0.0.1-2013-02-12-21:11:03
│   └── vmcore
└── lost+found

Sizing Local Dump Targets

The size of the core file, and therefore the amount of disk space necessary to store it, will depend on how much RAM is in use and what type of data is stored there. The only sure way to guarantee a successful dump is to have free space on disk at least equal to physical RAM. However using thecore_collector options (see the "Specifying Page Selection and Compression" section below) you can compress the core dump and remove specific types of pages from it. This should save you a large amount of space, but again it depends on how the system is being used. The compression ratio achieved using the "-c" option is entirely dependent on the content stored in RAM; some will compress better than others.

The best thing to do to determine the space requirements is to test the dump under typical system usage by using the "c" SysRq to crash the system and generate a sample core. Dumping to a dedicated dump server via NFS or SSH using the "net" option in kdump.conf (see the "Dumping to a Network Device" sections above) can help eliminate the need for reserved local storage and reduce overall dump storage requirements. Centralized network dump servers reduce overall storage needs through economies of scale, specifically the improbability that all the systems sharing the central dump server will need to use the storage during overlapping periods.

Specifying Page Selection and Compression

On large memory systems, it is advisable to both discard pages that are not needed and to compress remaining pages. This is done in kdump.conf with the core_collector command. At this point in them the only fully supported core collector is makedumpfile. The options can be viewed with makedumpfile --help. The -d option specifies which types of pages should be left out. The option is a bit mask, having each page type specified like so:

zero pages   = 1
cache pages   = 2
cache private = 4
user  pages   = 8
free  pages   = 16

In general, these pages may not contain relevant information. To set all these flags and leave out these pages, use a value of -d 31. However, if there are no size/space/time constraints, use a value of -d 1to strip zero pages only. The -c tells makedumpfile to compress the remaining data pages.

# throw out zero pages (containing no data)
# core_collector makedumpfile -d 1 
# throw out all trival pages         
# core_collector makedumpfile -d 31         
# compress all pages, but leave them all
# core_collector makedumpfile -c            
# throw out zero pages and compress (recommended)
core_collector makedumpfile -d 1 -c

Note: When making a change to kdump.conf, a service kdump restart is required. If you will be rebooting the system later, this step can be skipped

Clustered Systems

Cluster nodes can be fenced/rebooted before kdump has time to complete. In clustered environments it is generally necessary to configure additional time for kdump to complete before fencing. Please refer to the following for more information for clusters running the Red Hat High Availability or Resilient Storage Add-ons, RHEL Advanced Platform Cluster, or Red Hat Cluster Suite. How do I configure kdump for use with the RHEL High Availability Add-On?

Testing

After making the above changes, reboot the system. The 128M of memory (starting 16M into the memory) is left untouched by the normal system, reserved for the capture kernel. Take note that the output of free -m shows 128M less memory than without this parameter, which is expected.

Now that the reserved memory region is set up, turn on the kdump init script and start the service:

#  chkconfig kdump on
#  service kdump start

This will create a /boot/initrd-kdump.img, leaving the system ready to capture a vmcore upon crashing. To test this, force-crash the system by triggering a crash with sysrq:

Warning: This will panic your kernel, killing all services on the machine

# echo c > /proc/sysrq-trigger

(For more information about sysrq, refer to What is the SysRq facility and how do I use it?.)

This causes the kernel to panic, followed by the system restarting into the kdump kernel. When the boot process gets to the point where it starts the kdump service, the vmcore should be copied to the location specified in the /etc/kdump.conf file.

Time required to capture vmcore

Dumping time depends on the options that are used for its configuration. Refer to How to determine the time required for dumping a vmcore file with kdump?

Controlling when kdump is activated

There are several parameters that control under which circumstances kdump is activated. kdump can be activated when

a system hang is detected through the Non-Maskable Interrupt (NMI) Watchdog mechanism.
This mechanism is enabled through the nmi_watchdog=1 kernel parameter. Refer to What is NMI and what can I use it for? for details.
a hardware NMI button is pressed.
This mechanism is enabled by setting the sysctl kernel.unknown_nmi_panic=1 .
an "unrecovered" NMI has occurred.
This mechanism is enabled by setting the sysctl kernel.panic_on_unrecovered_nmi=1 . The following kernel warning messages are associated with "unrecovered" NMIs:

Uhhuh. NMI received for unknown reason *hexnumber* on CPU *CPUnumber*.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue

the out-of-memory killer (oom-killer) would otherwise be triggered.
This can be configured by setting the sysctl vm.panic_on_oom=1 .

Under many circumstances it is advisable to enable multiple tunables from the above list. As an example, in the event of hang events, it is adviseable to enable kernel.unknown_nmi_panic,kernel.softlockup_panic, and also nmi_watchdog=1. This will increase the likelihood that a vmcore will result from an event that an administrator may not be directly monitoring at the time.

Reducing the size of the vmcore when uploading to Red Hat Support

In most cases a vmcore analysis only requires the critical kernel pages within a vmcore so the rest of the pages can be filtered out to reduce the size of the file for faster uploading to Red Hat support. So if the vmcore file is very large and has not already had all non-critical pages filtered out (ie kdump did not use makedumpfile -d 31) then use this command to filter out the pages and upload the output file for analysis.

# makedumpfile -c -d 31 <vmcore> <output file>

Keep the original vmcore file saved in case the analysis requires some of the filtered pages and in that case the full vmcore may need to be uploaded.

Comments

Console frame-buffers and X are not properly supported. On a system typically run with something like"vga=791" in the kernel config line or with X running, console video will be garbled when a kernel is booted via kexec. The kdump kernel should still be able to capture a dump, and when the system reboots, video should be restored to normal.

debug_mem_level is a new parameter from RHEL6.3, it turns on debug/verbose output of kdump scripts regarding free/used memory at various points of execution. Higher level means more debugging output.

If unable to obtain a kernel dump but the machine can be rebooted, consider checking the system's RAM.

진단 단계

If you are dumping to local storage and utilize the hpsa storage module that you may run into difficulty capturing a core. In that event, please ensure you are on the latest kexec-tools package.
To output a list of configured dump locations, run the following egrep command:
```
egrep "path|raw|nfs|ssh|ext4|ext3|ext2|minix|btrfs|xfs|auto" /etc/kdump.conf
```

새로운 코멘트를 추가

Juergen Weiss

Newbie5 points

XEN-kernel:

title Red Hat Enterprise Linux Server (2.6.18-194.el5xen)
        root (hd0,0)
        kernel /xen.gz-2.6.18-194.el5 dom0_mem=2048M crashkernel=128M@16M
        module /vmlinuz-2.6.18-194.el5xen ro root=/dev/VolGroupXen03/LogVolXen0301 rhgb quiet
        module /initrd-2.6.18-194.el5xen.img

18시 02분 2010년 5월 4일

대답

Marc Milgram

Red HatPro723 points

Keep in mind that using the -d and -c options will marginally increase the ammount of time required to gather a cores.

This is frequently not the case. Storing a page on disk or sending it over the network usually takes significantly longer than determining that the page should not be saved. I am not sure about compression.

Using -d 31 may significantly reduce the amount of time required to gather a core.

22시 54분 2010년 11월 2일

대답

Richard Beldin

Pro623 points

To do selective dumps, you need the corresponding kernel debuginfo.

From /usr/share/doc/kexec-tools-1.102pre/kexec-kdump-howto.txt

A typical setup is 'core_collector makedumpfile -c', but check the output of

'/sbin/makedumpfile --help' for a list of all available options (-i and -g

don't need to be specified, they're automatically taken care of). Note that

use of makedumpfile requires that the kernel-debuginfo package corresponding

with your running kernel be installed.

02시 10분 2011년 7월 22일

대답

Kamal Maiti

Red HatExpert940 points

During dumping vmcore to a Network Device using NFS, make it sure following :

1. "vmcores" sub-directory has proper permission on NFS sever.

2. It has correct export options. I suggest to use

rw,sync,no\_all\_squash\, If there is permission issue to start kdump, please allow whole subnet\, As an example it'll look like :

cat /etc/exports
/export/vmcores      192\,168\,1\,0/24\(rw,sync,no\_all\_squash\)

3. vmcore file will be saved inside /export/vmcores/var/crash directory.

21시 43분 2011년 8월 1일

대답

Peter Govan

Newbie19 points

does it also require ulimit -c to be set to reasonably large like "unlimited" for kdump to work?

07시 18분 2012년 2월 13일

대답

Dave Maley

Red HatActive Contributor295 points

No, this is not required for kdump. The ulimit -c value is specific to application cores and does not affect kdump capturing vmcores.

22시 41분 2012년 2월 13일

대답

Steven Vik

Newbie10 points

Kdump on a xen server seems not to wok with makedumpfile standard args ... It has been demonstarted that the -E option added and the removal of any other flags is necessary for kdump to suceed. This is not documented and was found by our TAM in a non-published doc - Steve Vik

22시 24분 2012년 5월 12일

대답

Vladimir Stys

Community Member20 points

Yes, this would nice to patch in this howto. To start around this issue I recommend reading of "makedumpfile --help".

21시 56분 2012년 11월 30일

대답

Henry Hutton

Red HatGuru4309 points

This is a very helpful doc, thanks!

00시 20분 2012년 11월 7일

대답

Kim Hyungshin

Red HatActive Contributor175 points

Add multipath device and multipath device with blacklist in RHEL6.

00시 07분 2013년 2월 13일

대답

Thillai Thayanidhi

Community Member40 points

Once server is rebooted after crash , do we require to reboot again for original kernel or it would be running on crash kernel on production environment.....

23시 21분 2013년 5월 26일

대답

James Viswasam

Newbie5 points

I would say it will reboot to the kernal it booted from previously.
Did you install new Kernel and did a reboot on the new one when it crashed ?
If so, you can intercept the boot process and switch to the old one if it is a recurring crash.

02시 42분 2013년 5월 28일

대답

IBM RHN

Active Contributor105 points

cat /etc/fstab | grep -i crash
cat /etc/fstab
/dev/rootvg/varcrashlv /var/crash ext3 defaults 1 2

df -h /var/crash
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rootvg-varcrashlv
8.1G 138M 7.5G 2% /var/crash
which name do i use in /etc/kdump.conf i have this at present ext3 /dev/mapper/rootvg-varcrashlv

05시 04분 2013년 12월 25일

대답

John Westerdale

Active Contributor137 points

For a System with 64 GB Memory, do I need to go beyond crashkernel=128@16M ? Is it possible to do a crash analysis on a Fedora machine, or do I have to use RHEL to get the debug kernel RPMS installed?

22시 54분 2014년 3월 13일

대답

Guy Streeter

Red HatPro415 points

This article doesn't really cover doing your own crash analysis, but to analyze a RHEL vmcore on Fedora you would need to extract the necessary file(s) from the matching RHEL kernel-debuginfo RPM and tell the crash program where to find it.

23시 53분 2014년 3월 13일

대답

Gaurav Gupta

Newbie5 points

Configuring KDUMP will require reboot?

13시 15분 2014년 6월 11일

대답

Jay Shin

Red HatActive Contributor107 points

Because of adding Boot Parameters in grub.conf,
you should require reboot.

See this,
2. Prerequisites
...
After adding the crashkernel parameter the system must be rebooted for the crashkernel memory to be reserved for use by kdump. This reboot can be performed now or after the below steps to configure kdump have been completed.

2014-06-11T13:32:05+09:00

대답

hebergement web

Newbie5 points

Do you recommend to leave kdump service on or just enable it when there's issue ?

2014-07-04T05:07:06+09:00

대답

John Siddle

Red HatNewbie15 points

Hi hebergement, you should leave it on.

2014-07-04T06:53:44+09:00

대답

새로운 코멘트를 추가

저작자표시

'OS > Linux' 카테고리의 다른 글

[RHEL] RHEL6 설치시 GUI Login (0)	2014.08.28
[RHEL] What should go in password-auth vs system-auth in RHEL6? (0)	2014.08.28
[RHEL] How to configure pam_tally2 to lock user account after certain number of failed login attempts ? (0)	2014.08.28
[RHEL] How to lock out a user to login a system after a set number of failed attempts? (0)	2014.08.28
[RHEL] When I use the SSH method to transfer a kdump vmcore, the resultant vmcore.flat file is unreadable by crash. What am I doing wrong and how can I fix this? (0)	2014.08.14
[RHEL] 리눅스 부팅 프로세스 연구 (한글) (0)	2014.06.24
[RHEL] 리눅스 x86 부팅 과정 (0)	2014.06.24
[RHEL] Linux initial RAM disk (initrd) overview (0)	2014.06.24
[RHEL] find 명령으로 공백이 포함된 디렉토리/파일 삭제 방법 (0)	2014.06.24
[RHEL] hugetlbpage 매뉴얼 (0)	2014.06.24

TOP GUN

[RHEL] How to troubleshoot kernel crashes, hangs, or reboots with kdump on Red Hat Enterprise Linux

How to troubleshoot kernel crashes, hangs, or reboots with kdump on Red Hat Enterprise Linux

문제

환경

해결

Contents

Background / Overview

Prerequisites

Installing kdump

Adding Boot Parameters

Specifying the Kdump Location

Dumping Directly to a Device

Dumping to a file on Disk

Dumping to a Network Device using NFS

Dumping to a Network Device using SSH

Dumping to a SAN Device (For RHEL5)

Dumping to a SAN Device ( For RHEL6 with blacklist of multipath)

Dumping to a SAN Device ( For RHEL6 with multipath device)

Sizing Local Dump Targets

Specifying Page Selection and Compression

Clustered Systems

Testing

Time required to capture vmcore

Controlling when kdump is activated

Reducing the size of the vmcore when uploading to Red Hat Support

Comments

진단 단계

댓글

During dumping vmcore to a Network Device using NFS, make it sure following :

새로운 코멘트를 추가

'OS > Linux' 카테고리의 다른 글

+ Recent posts

티스토리툴바