Technote (FAQ)
Question
How to examine a minidump in AIX
Answer
IntroductionThe purpose of this document is to discuss a feature that is new in AIX starting with AIX 5.3 TL3. This feature is useful if an instance of AIX has crashed and no crash dump has been generated. It is also useful in situations where an instance is unstable and can only be accessed in maintenance mode.

As part of the RAS (Reliability, Accessibility, and Serviceability) effort in AIX, a new error log entry of type MINIDUMP_LOG was created and placed in the error template. When errpt -a is run against /var/adm/ras/errlogyou will see this label followed by a series of hex digits. This output is not useful in examining the minidump. To examine a minidump you need to have access to a copy of the errlog on the affected server. You will also need access to the mdmprpt command.

Limitations of minidumps

A minidump is of limited or no use in situations where a server has hung. In this situation we can use mdmprpt to see what was running on each cpu. Practically speaking a full dump is usually needed to determine how and why a server has hung.

mdmprpt command line usage

The best way to see the command line usage for mdmprt is to type something like mdmprpt -j:

$ mdmprpt -j
mdmprpt: Not a recognized flag: j
Usage:
mdmprpt [-l seq_no] [-i filename] [-r]

Process error log entries from the supplied file(s).
-l seq_no Format the minidump at the specified sequence number in
the error log.
-i filename Uses the error log file specified by the filename param.
-r print the raw data from the error log, without formatting.

On a customer’s system usually typing mdmprpt and either redirecting the output to a file or using the morecommand is all that is necessary. When looking at a snap that has a copied version of the errlog you will want to run

mdmprpt -i ./errlog

If there have been repeated crashes and you want to see if they all have the same crash stack, look in the output of errpt -a and look for the sequence numbers for MINIDUMP_LOG entries. You can then type

mdmprpt -i ./errlog -l seqno

where seqno is one of the sequence numbers you obtained , Examine the stacks to see if they are the same.

Examining mdmprpt output

For practical purposes only the crash stack for the faulting cpu is relevant as well as the symptom information. Here is an example of what is meant:

MINIDUMP VERSION 4D32
***************************************************
64-bit Kernel, 58 Entries

Last Error Log Entry:
Error ID: 9D035E4D Resource Name: SYSVMM
Detail Data: 0000000000000000 4000000000000000
30066400F1000006 60EC216600000000
0000000E

Symptom Information:
Crash Location: [0000000000264E20] dbAdjTree+18
Component: COMP Exception Type: 14

Data From CPU #0 (Faulting CPU)
***************************************************
Stack Trace:
[0000000000264E20] dbAdjTree+18
[0000000000265348] dbSplit+80
[000000000026517C] dbBackSplit+E8
0000000000264D48] dbAdjCtl+280
[0000000000264664] dbAllocDmap+60
[00000000002686E8] dbAlloc+224
[000000000025F434] pagerAllocateBl+8C
[000000000026076C] pagerAllocateEx+C0
[00000000002ADE18] pageIn+864
[00000000002AEB48] j2PagerService+660
[00000000002AB114] j2PagerThread+120
[00000000001591C4] threadentry+14

Using this crash stack IBM support personnel can then search through the database to find what the fault may mean. In the above example the fault is due to JFS2 file system corruption. A full crash dump would be needed to find out which file system was involved.

If the minidump is corrupt there are no conclusions to be drawn. Sometimes you may see symptom information and no crash stack. If the Exception Type is 5 this means the server crashed due to an I/O fault to a paging device. If there are no obvious errors in the error report for disks and lvm objects a dump would need to be examined to find which paging device was affected.

Conclusion
The RAS effort mentioned in the introduction is part of an ongoing effort by AIX to increase stability and to make more information available for troubleshooting when a problem occurs. The ability to look at minidump data has helped solve many issues that would otherwise go unresolved.


+ Recent posts