SCAT 5.0 now available for Solaris 10 and x86, x64
Sun, 08/03/2008 - 21:23 by David Rubio
Great News! One of my top 6 favorite tools on Solaris, SCAT 5.0 (Solaris Crash Analysis Tool) has just been released for Solaris 10 and equally important: it now works on x86, x64 platforms. Unlike DTrace, mdb(1), mpstat(1M), prstat(1M),and the proc(1) commands, my other 5 favorites, SCAT is not bundled with Solaris; you can download it from: https://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_SMI-Site/en_US... I admit, it does not have the general applicability of DTrace, but if you love learning or working in the Solaris kernel, it's the tool for you. It is used by Sun's kernel support engineers to determine the root cause of customer's system crash dumps. It is more user friendly than mdb(1) (but again not as general use) , and is mainly used for displaying kernel state at the time of the crash or it can be used on your live system. I will demo SCAT by intentionally causing my Solaris 10 kernel (update 5, x86 running native on my 4-year old HP laptop) to crash by clobbering the global variable in the kernel that points to the root directory vnode: rootdir. The first user of this variable which will probably be the next application that references a full pathname will cause the system to crash. This example is taken from the classic PANIC book by Chris Drake and Kimberley Brown which is still worth buying a used copy (I just saw one at Amazon for $3.85) to get the basic concepts. Lets use mdb(1) on the kernel in write mode to write garbage to rootdir:
The following messages show the kernel stack trace at the time of the BAD TRAP panic that crashed the system: ... unix:mutex_enter+d (d4d4ee80, d4d4ee90,) Jul 30 11:22:22 unknown genunix: [ID 353471 kern.notice] d4d4ee54 genunix:lookuppn+19 (d4d4ee80, d4d4ee90,) Jul 30 11:22:22 unknown genunix: [ID 353471 kern.notice] d4d4ef6c genunix:exec_common+18d (81125a8, 81125e8, 8) Jul 30 11:22:22 unknown genunix: [ID 353471 kern.notice] d4d4ef84 genunix:exece+13 (81125a8, 81125e8, 8) Jul 30 11:22:22 unknown unix: [ID 100000 kern.notice] Jul 30 11:22:22 unknown genunix: [ID 672855 kern.notice] syncing file systems... Jul 30 11:22:22 unknown genunix: [ID 904073 kern.notice] done Jul 30 11:22:23 unknown genunix: [ID 111219 kern.notice] dumping to /dev/dsk/c0d0s1, offset 189202432, content: kernel Jul 30 11:22:27 unknown genunix: [ID 409368 kern.notice] 100% done: 25174 pages dumped, compression ratio 2.38, Jul 30 11:22:27 unknown genunix: [ID 851671 kern.notice] dump succeeded
Crash image was saved to the dump device and the system is now rebooting:
Jul 30 11:22:54 unknown genunix: [ID 540533 kern.notice] SunOS Release 5.10 Version Generic_127128-11 32-bit Jul 30 11:22:54 unknown genunix: [ID 172908 kern.notice] Copyright 1983-2008 Sun Microsystems, Inc. All rights reserved. Jul 30 11:22:54 unknown Use is subject to license terms. Jul 30 11:22:54 unknown unix: [ID 126719 kern.info] features: 1007ddf Jul 30 11:22:54 unknown unix: [ID 168242 kern.info] mem = 489980K (0x1de7f000) Jul 30 11:22:54 unknown rootnex: [ID 466748 kern.info] root nexus = i86pc Jul 30 11:22:54 unknown rootnex: [ID 349649 kern.info] pseudo0 at root Jul 30 11:22:54 unknown genunix: [ID 936769 kern.info] pseudo0 is /pseudo ...
The savecore utility is saving the dump image to the crash files and shows the panic message (BAD TRAP). You can see that it shows the bad address used which caused a page fault trap resulting in the crash:
Jul 30 11:23:48 unknown savecore: [ID 570001 auth.error] reboot after panic: BAD TRAP: type=e (#pf Page fault) rp=d4d4edd0 addr=123 occurred in module "unix" due to a NULL pointer dereference Jul 30 11:23:48 unknown savecore: [ID 748169 auth.error] saving system crash dump in /var/crash/unknown/*.2 ... After installing SCAT and setting my PATH to include /opt/SUNWscat/bin, lets go look at the crash files:
laptop# cd /var/crash/unknown laptop# ls bounds unix.0 unix.1 unix.2 vmcore.0 vmcore.1 vmcore.2 laptop# scat 2
Solaris[TM] CAT 5.0 for Solaris 10 32-bit x86 SV4622M, Jul 3 2008
As you can see SCAT displays a lot of useful information when first invoked. Some background on crash dump analysis: There are 3 classes of PANICs: 1. Unique panic string -> sanity check in kernel code that fails 2. BAD TRAP -> kernel had an exception fault; the most common being a reference to an illegal address 3. sync initiated -> someone entered a Stop A at OPB (Sparc) console probably because the system was hung or unresponsive.
For 1. and 2. you should enter the panic command to get the pertinent details. For 3. enter the "thread summary" command to find threads in unusual states such as too many waiting on the same mutex
SolarisCAT(vmcore.2/10x)> panic panic on cpu 0 panic string: BAD TRAP: type=e (#pf Page fault) rp=d4d4edd0 addr=123 occurred in module "unix" due to a NULL pointer dereference ==== panic user (LWP_SYS) thread: 0xd4d0fa00 PID: 994 on CPU: 0 ==== cmd: bash t_procp: 0xd446a4a8 p_as: 0xd6239718 size: 2658304 RSS: 1003520 hat: 0xd6346c10 cpuset: 0 zone: global t_stk: 0xd4d4efac sp: 0xd4d4ecd4 t_stkbase: 0xd4d4d000 t_pri: 59(IA) pctcpu: 0.043860 t_lwp: 0xd4b72700 lwp_regs: 0xd4d4efac mstate: LMS_SYSTEM ms_prev: LMS_KFAULT ms_state_start: 3.255056836 seconds earlier ms_start: 3.267271238 seconds earlier psrset: 0 last CPU: 0 idle: 127702 ticks (21 minutes 17.02 seconds) start: Wed Jul 30 11:22:22 2008 age: 0 seconds (0 seconds) syscall: #59 execve(, 0x0) (sysent: genunix:exece+0x0) tstate: TS_ONPROC - thread is being run on a processor tflg: T_PANIC - thread initiated a system panic T_DFLTSTK - stack is default size tpflg: TP_TWAIT - wait to be freed by lwp_wait TP_MSACCT - collect micro-state accounting information tsched: TS_LOAD - thread is in memory TS_DONT_SWAP - thread/LWP should not be swapped pflag: SJCTL - SIGCLD sent when children stop/continue SMSACCT - process is keeping micro-state accounting SMSFORK - child inherits micro-state accounting
From the output above, I can tell that my bash shell was issuing an execve system call which caused the system to crash. A full pathname will reference rootdir. A typical cause of BAD TRAP kernel panics is some kernel software module like a 3rd party driver stepping on another module's memory. Realize that the crash will not happen until some thread uses this stepped on memory which could be minutes later. These types of crashes are difficult to determine the root cause because the error occurs so much earlier than the crash. One generally recommended tunable to set to maybe find the stomping module is kmem_flags which are detailed in the Solaris Modular Debugger Guide Chapter 8. In my next blog, I will show how to use DTrace to do this which should be less overhead than setting kmem_flags and easier.
Continuing with the analysis on this crash (assuming I did not know the cause), lets view the legal range of kernel virtual addresses:
SolarisCAT(vmcore.2/10x)> seg S -segaddr-- -segdata-- ----address range---- ----size----- pszc type S 0xfec212d0 0xdb0a0828 0xd2802000-0xd37fffff 16769024 4.00K segmap [segkmap] S 0xfec21208 0xfec1fc94 0xd3800000-0xf9bfffff 641728512 4.00K segkmem [kvseg] S 0xfec2132c 0xfec1fc94 0xfdc00000-0xfe7fffff 12582912 4.00K segkmem [kvalloc] S 0xfec3b090 0xfec1fc94 0xfe800000-0xfecd6fff 5074944 4.00K segkmem [ktextseg] S 0xfec3b678 0xfec1fc94 0xff800000-0xffbfffff 4194304 4.00K segkmem [kdebugseg] found 5 segs, total size: 680349696, swap reserve 0
The address 123 is definitely outside this range. Lets search all of kernel memory for anything storing this address:
I see the cause! Something clobbered rootdir because this vnode pointer is storing 123 which is an illegal address. I know that exec(2) done by a shell would be referencing it. I also know that this address would be passed to mutex_enter to acquire the vnode mutex which is at the beginning of the vnode structure (<sys/vnode.h>). This is why mutex_enter had the BAD TRAP. I was surprised to see 123 stored in so many places in the kernel until I looked further and realized it was an instruction:
More useful information on SCAT; - There is a GUI form called blast - It has a very extensive help facility: just enter help to get all the commands organized into various categories or help command for help on a specific command - Its command line interface uses the Korn shell, so you can pipe SCAT commands to UNIX commands - There are a number of environment settings which can be viewed with the scatenv command - If you download the 4.1 version you get a great tutorial: BOOK.pdf that shows the analysis of about 8 different types of crashes using SCAT. - tlist lets you get extensive details on every thread or specific threads meeting a certain criteria such as waiting for a mutex, or having a certain function on its is stack.
With this tool, the Solaris 10 Internals book, DTrace, Open Solaris source browser, and a lot of time, you are well on your way to becoming a kernel guru.