Sun Solutions by Forsythe
David Rubio
Senior Consultant

Using mdb on the Solaris kernel to dump pages of memory

Wed, 04/09/2008 - 17:44 by David Rubio

I will demonstrate the power of mdb(1) by displaying kernel data structures related to virtual memory (VM) in order to show how to dump the contents of any running application's memory or any page of memory in general. As an example I will run a simple shell script that I will end up locating in the shell's heap. When I first figured out this example, it was an educated guess on my part that a shell script would end up in the heap of the shell interpreter. The file name of an interpreter script (files starting with #!) gets passed to the invoked interpreter as one of its arguments. So the interpreter must read the file in and start interpreting it! It is logical to assume it uses dynamic memory for this. Imagine this is a malicious script that first removes itself and then goes on to cause harm including crashing your system. If you set up dumpadm(1M) to save all of physical memory  (not a good idea unless you have humongous swap space configured ) when your system crashes, you could still find this malicious program in the crash dump's image. I won't go that far, I will just display the script after finding it in memory on the running system. The steps should be identical on a crash dump. Here is the script:

# cat /badscript

#!/bin/sh

# pretend I am here to cause harm to your system

# lets remove ourself first

/bin/rm /badscript

pwd

date

sleep 999

# /badscript > /dev/null 2>&1&

[1] 4438

# ls -l /badscript

/badscript: No such file or directory

The directory entry gets removed, but the file is not actually removed until last close() which will happen when the script exits. This is a technique used by applications/scripts on temp files so they are removed immediately in case they get accidently killed later. OK, let's start looking at process info with mdb (I only have 999 seconds):

Invoke mdb on the running kernel (this of course requires privilege):

# mdb -k

Loading modules: [ unix krtld genunix dtrace specfs ufs sd pcisch sgsbbc sgenv ip hook neti sctp arp usba fcp fctl emlxs nca lofs zfs random ipc nfs sppp crypto ptm logindmux md cpc fcip ]

Display info for process badscript. Alternatively, the ::ps command would have displayed all processes and I could have found badscript in the list:

> ::pgrep badscript

S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME

R   4438  17914   4438  17914      0 0x4a004000 0000060006fa8cd8 badscript

The model for mdb (which goes back to adb; UNIX's first debugger) is to start a command with an address, either a symbolic or actual address. The intent being that you want to perform some action like display memory at that address. You can see the general format of a command with the ::help dcmd (debugger command). The very readable documentation for mdb is Solaris Modular Debugger Guide at docs.sun.com. In this case the address which is under the ADDR label above is that of the data structure in the kernel that implements a process which is called a proc structure or proc_t type. A data structure is simply a named collection of related data used to implement some object in this case a process. The proc structure is declared in the /usr/include/sys/proc.h header file. The separate data items in a structure are referred to as fields or members. A very powerful mdb command was introduced in Solaris 9: print. Given the address of some object (data structure), it will display its contents (with labels matching the structure declaration) if you specify its type. Basically: print object at address. If you follow the type with a field name then it displays just that field of the structure. The p_as field of a proc structure points to its address space data structure (/usr/include/vm/as.h) which contains general information on a process's address space. A process's address space consists of an ordered set of segments. The seg data structure (/usr/include/vm/seg.h) represents a segment of a running process (all of which  can be displayed with the pmap(1) command.) The seg object models where in a process's virtual memory is a contiguous range of  bytes of a file  being mapped. The list of seg structures used to be organized in a doubly linked list, but was changed to an AVL tree (essentially balanced binary tree) in Solaris 9 to speed up searches.

Mdb has a basic construct called a walker that lets you iterate across all of the objects organized in some manner like a linked list. Walkers output the addresses of each item in the linked list, AVL tree, or whatever organization the kernel uses for the set of objects. A walker is usually followed by a pipe in order to send the addresses to a dcmd command that knows how to print each object, such as ::seg prints the seg structure information.

> 0000060006fa8cd8::print proc_t p_as

p_as = 0x60006d82d40

> 0x60006d82d40::walk seg | ::seg

             SEG             BASE             SIZE             DATA OPS

     60006e645a0            10000            16000      600110feaa8 segvn_ops

     600074f3b48            36000             2000      60011103728 segvn_ops

     600111013f8            38000             4000      600110ff058 segvn_ops

     600110ac828         ff26e000             2000      600110a9b38 segvn_ops

     60010bfe288         ff280000            de000      60006d66068 segvn_ops

     60011100ea0         ff36e000             8000      60006d8c480 segvn_ops

     600058e0ca8         ff376000             2000      600110cfae0 segvn_ops

     60006e65d40         ff380000             2000      600111075f8 segvn_ops

     60006f9a240         ff390000             2000      60011106d70 segvn_ops

     600074f3560         ff3a0000             6000      60011102548 segvn_ops

     60006f9aee8         ff3b0000            34000      60007582bd8 segvn_ops

     60011108dc8         ff3f4000             2000      600110a9860 segvn_ops

     60011109488         ff3f6000             2000      60006e67f60 segvn_ops

     600110d26c0         ffbfe000             2000      60011102478 segvn_ops

The main fields in a seg structure are base and size which indicate the starting virtual address of the segment and its size. The addresses under the DATA label are the addresses of a data structure referred to as a segment driver. The term driver generally means a set of interface functions. A segment, like much data in the kernel is implemented as an object. Even though the kernel is primarily written in the C Programming Language it uses object oriented programming techniques. An object has an interface consisting of a set of functions that define the operations allowed on that object. An object can also have private data only accessible by these interface functions. The main benefit of object oriented programming is data encapsulation. If all software in the kernel uses this object only through its interface, then it can be re-implemented later without breaking the code. Otherwise, large programs like the kernel (especially so, since it needs to be efficient) end up breaking when you re-implement something because of other parts of the software taking advantage of the implementation. The third segment in this case is the heap. The private data of the seg object is the segment driver (segvn_data in /usr/include/vm/segvn.h) which is printed as follows:

> 600110ff058::print "struct segvn_data"

{

    lock = {

        _opaque = [ 0 ]

    }

    segp_slock = {

        _opaque = [ 0 ]

    }

    pageprot = 0

    prot = 0xf

    maxprot = 0xf

    type = 0x2

    offset = 0

    vp = 0

    anon_index = 0

    amp = 0x60006d93788

    vpage = 0

    cred = 0x6000f19ef40

    swresv = 0x4000

    advice = 0

    pageadvice = 0

    flags = 0

    softlockcnt = 0

    policy_info = {

        mem_policy = 0x1

        mem_reserved = 0

    }

}

The segvn_data structure contains information like how to protect the memory (which is accomplished through the MMU) and how much swap space to reserve for this segment. The field we are after is amp which is the address of yet another data structure called an anon map (/usr/include/vm/anon.h). This is how through the anon walker we get access to the individual anon structures which get allocated every time a process first references an anonymous page (a page backed by swap) like a heap page. The output of the anon walker shows that this process (/bin/sh) only has two pages of heap. This is consistent with the segment print output above.

A big picture of all of these data structures and what points (contains the address of) to what can be found in Figure 9.11 of  Solaris Internals 2nd Edition (McDougall and Mauro)

> 0x60006d93788::walk anon

60010e4dbc0

60010e6fbd0

> 60010e4dbc0::print "struct anon"

{

    an_vp = 0x600110c4fc0

    an_pvp = 0

    an_off = 0x60026de0000

    an_poff = 0

    an_hash = 0

    an_refcnt = 0x1

}

Every page of DRAM which is all managed by the kernel is identified with a unique key called vnode and offset. The vnode (/usr/include/sys/vnode.h) is yet another important object in the kernel representing a file from any type of file system. The offset indicates which page of the file. There are thousands of page_t (/usr/include/vm/page.h) data structures maintained by the kernel to describe every page of DRAM. It tells us what pages from disk are currently being held in DRAM. There is basically one page_t record for every page of DRAM (8Kb or 0x2000 bytes in size on SPARC V9). The ::page dcmd will display every such record. Therefore you can always tell exactly what is being held in DRAM. Anonymous pages are implemented through a pseudo file system called swapfs. An anon structure gets created to store the vnode, offset pair to uniquely identify this page of anonymous (heap) memory. If the page had been paged out or swapped out, it would have the an_pvp and an_poff fields filled in to represent where on a swap device we paged out this page. We just need to search for this anonymous page's unique vnode/offset from the output of the ::page dcmd. The ! is a pipe symbol when piping to shell commands.

> ::page ! grep 60026de0000

000007000c71d580      600055e60c0      60026de0000        0   0   0  0  0 10

000007000e651a00      600110c4fc0      60026de0000        0   0   0  0  0 10

> ::page ! line

            PAGE            VNODE           OFFSET   SELOCK LCT COW IO FS ST

Hmmm, this is the first time two records have shown up. Notice the vnode addresses are different. These are swapfs vnodes. Lets print out each page_t structure. We are after the pagenum which is the physical DRAM page number.

> 000007000c71d580::print page_t

{

    p_offset = 0x60026de0000

    p_vnode = 0x600055e60c0

    p_selock = 0

    p_selockpad = 0

    p_hash = 0

    p_vpnext = 0x7000c71cf80

    p_vpprev = 0x7000c71dc80

    p_next = 0x7000c71d580

    p_prev = 0x7000c71d580

    p_lckcnt = 0

    p_cowcnt = 0

    p_cv = {

        _opaque = 0

    }

    p_io_cv = {

        _opaque = 0

    }

    p_iolock_state = 0

    p_szc = 0x3

    p_fsdata = 0

    p_state = 0x10

    p_nrm = 0x3

    p_vcolor = 0x2

    p_index = 0x8

    p_toxic = 0

    p_mapping = 0

    p_pagenum = 0x14016b

    p_share = 0

    p_sharepad = 0

    p_slckcnt = 0

    p_kpmref = 0

    p_kpmelist = 0

    p_msresv_2 = 0

}

The following complicated command is key. It takes the page number and multiplies by the page size to get the physical DRAM address of the first page of the shell's heap. It then dumps 0x400 (1K) 8 byte chunks of memory in hexidecimal. The J is called a format letter of which there are dozens. They specify how many bytes of memory to display and in what base: Octal, Decimal, or Hexidecimal. Use ::formats to see them all. The \ indicates the address is to be interpreted as a physical not virtual address which is normally what you are working with in mdb. Finally, the a in front of the J indicates to display the address of each 8 byte chunk in front of the data. I am searching for 2321 which represents #! in hex (see ascii(5))

> 0x14016b*2000,400\aJ ! grep 2321

OK, lets try the second page_t record:

> 000007000e651a00::print page_t

{

    p_offset = 0x60026de0000

    p_vnode = 0x600110c4fc0

    p_selock = 0

    p_selockpad = 0

    p_hash = 0

    p_vpnext = 0x70003e7e600

    p_vpprev = 0x7000e19e780

    p_next = 0x7000e651a00

    p_prev = 0x7000e651a00

    p_lckcnt = 0

    p_cowcnt = 0

    p_cv = {

        _opaque = 0

    }

    p_io_cv = {

        _opaque = 0

    }

    p_iolock_state = 0

    p_szc = 0

    p_fsdata = 0

    p_state = 0x10

    p_nrm = 0x3

    p_vcolor = 0x1

    p_index = 0

    p_toxic = 0

    p_mapping = 0x3000372bb88

    p_pagenum = 0x17e7f4

    p_share = 0x1

    p_sharepad = 0

    p_slckcnt = 0

    p_kpmref = 0

    p_kpmelist = 0

    p_msresv_2 = 0

}

> 0x17e7f4*2000,400\aJ ! grep 2321

                0               0x2fcfe9588:    0               0x2fcfe9590:    0               0x2fcfe9598:    39af00003a0c0   0x2fcfe95a0:    ffbffca8        0x2fcfe95a8:    22530ffbffd18   0x2fcfe95b0:    224bc00000000   0x2fcfe95b8:    0               0x2fcfe95c0:    0               0x2fcfe95c8:    0               0x2fcfe95d0:    3a0d000000000   0x2fcfe95d8:    3a17000000000   0x2fcfe95e0:    1300000008      0x2fcfe95e8:    80000003968d    0x2fcfe95f0:    3968d00000000   0x2fcfe95f8:    7d              0x2fcfe9600:    7d              0x2fcfe9608:    0               0x2fcfe9610:    23212f62696e2f73 0x2fcfe9618:                   680a0a2320707265 0x2fcfe9620:                   74656e6420492061 0x2fcfe9628:                   6d20686572652074 0x2fcfe9630:                   6f20636175736520 0x2fcfe9638:                   6861726d20746f20 0x2fcfe9640:                   796f757220737973 0x2fcfe9648:                   74656d2c206c6574 0x2fcfe9650:                   732072656d6f7665 0x2fcfe9658:

I spot it! Dump it out as a string since I know a script is ascii text:

> 0x2fcfe9610\s

0x2fcfe9610:    #!/bin/sh

# pretend I am here to cause harm to your system, lets remove ourself first

/bin/rm /badscript

pwd

date

sleep 999

> $q

#

I realize this may be a bit complicated. I will try to start with the basics next time like "what is software?" in order to start describing DTrace and its capabilities.



Comments

Call me a geek, but this is

Call me a geek, but this is some cool stuff.

Do you have any other demos of this nature? I'm more interested in the problem than the solution, though, as I'd like to figure them out for myself. Perhaps a running series of mdb brain-teasers, with solutions posted later?

Chad Mynhier