Using mdb on the Solaris kernel to dump pages of memory
Wed, 04/09/2008 - 17:44 by David Rubio
I will demonstrate the power of mdb(1) by displaying kernel data structures related to virtual memory (VM) in order to show how to dump the contents of any running application's memory or any page of memory in general. As an example I will run a simple shell script that I will end up locating in the shell's heap. When I first figured out this example, it was an educated guess on my part that a shell script would end up in the heap of the shell interpreter. The file name of an interpreter script (files starting with #!) gets passed to the invoked interpreter as one of its arguments. So the interpreter must read the file in and start interpreting it! It is logical to assume it uses dynamic memory for this. Imagine this is a malicious script that first removes itself and then goes on to cause harm including crashing your system. If you set up dumpadm(1M) to save all of physical memory(not a good idea unless you have humongous swap space configured ) when your system crashes, you could still find this malicious program in the crash dump's image. I won't go that far, I will just display the script after finding it in memory on the running system. The steps should be identical on a crash dump. Here is the script:
# cat /badscript
#!/bin/sh
# pretend I am here to cause harm to your system
# lets remove ourself first
/bin/rm /badscript
pwd
date
sleep 999
# /badscript > /dev/null 2>&1&
[1] 4438
# ls -l /badscript
/badscript: No such file or directory
The directory entry gets removed, but the file is not actually removed until last close() which will happen when the script exits. This is a technique used by applications/scripts on temp files so they are removed immediately in case they get accidently killed later. OK, let's start looking at process info with mdb (I only have 999 seconds):
Invoke mdb on the running kernel (this of course requires privilege):
The model for mdb (which goes back to adb; UNIX's first debugger) is to start a command with an address, either a symbolic or actual address. The intent being that you want to perform some action like display memory at that address. You can see the general format of a command with the ::helpdcmd (debugger command). The very readable documentation for mdb is Solaris Modular Debugger Guide at docs.sun.com. In this case the address which is under the ADDR label above is that of the data structure in the kernel that implements a process which is called a proc structure or proc_t type. A data structure is simply a named collection of related data used to implement some object in this case a process. The proc structure is declared in the /usr/include/sys/proc.h header file. The separate data items in a structure are referred to as fields or members. A very powerful mdb command was introduced in Solaris 9: print. Given the address of some object (data structure), it will display its contents (with labels matching the structure declaration) if you specify its type. Basically: print object at address. If you follow the type with a field name then it displays just that field of the structure. The p_as field of a proc structure points to its address space data structure (/usr/include/vm/as.h) which contains general information on a process's address space. A process's address space consists of an ordered set of segments. The seg data structure (/usr/include/vm/seg.h) represents a segment of a running process (all of which can be displayed with the pmap(1) command.) The seg object models where in a process's virtual memory is a contiguous range ofbytes of a file being mapped. The list of seg structures used to be organized in a doubly linked list, but was changed to an AVL tree (essentially balanced binary tree) in Solaris 9 to speed up searches.
Mdb has a basic construct called a walker that lets you iterate across all of the objects organized in some manner like a linked list. Walkers output the addresses of each item in the linked list, AVL tree, or whatever organization the kernel uses for the set of objects. A walker is usually followed by a pipe in order to send the addresses to a dcmd command that knows how to print each object, such as ::seg prints the seg structure information.
> 0000060006fa8cd8::print proc_t p_as
p_as = 0x60006d82d40
> 0x60006d82d40::walk seg | ::seg
SEGBASESIZEDATA OPS
60006e645a01000016000600110feaa8 segvn_ops
600074f3b4836000200060011103728 segvn_ops
600111013f838000 4000600110ff058 segvn_ops
600110ac828ff26e0002000600110a9b38 segvn_ops
60010bfe288ff280000de00060006d66068 segvn_ops
60011100ea0ff36e000800060006d8c480 segvn_ops
600058e0ca8ff3760002000600110cfae0 segvn_ops
60006e65d40ff3800002000600111075f8 segvn_ops
60006f9a240ff390000200060011106d70 segvn_ops
600074f3560ff3a0000600060011102548 segvn_ops
60006f9aee8ff3b00003400060007582bd8 segvn_ops
60011108dc8ff3f40002000600110a9860 segvn_ops
60011109488ff3f6000200060006e67f60 segvn_ops
600110d26c0ffbfe000200060011102478 segvn_ops
The main fields in a seg structure are base and size which indicate the starting virtual address of the segment and its size. The addresses under the DATA label are the addresses of a data structure referred to as a segment driver. The term driver generally means a set of interface functions. A segment, like much data in the kernel is implemented as an object. Even though the kernel is primarily written in the C Programming Language it uses object oriented programming techniques. An object has an interface consisting of a set of functions that define the operations allowed on that object. An object can also have private data only accessible by these interface functions. The main benefit of object oriented programming is data encapsulation. If all software in the kernel uses this object only through its interface, then it can be re-implemented later without breaking the code. Otherwise, large programs like the kernel (especially so, since it needs to be efficient) end up breaking when you re-implement something because of other parts of the software taking advantage of the implementation. The third segment in this case is the heap. The private data of the seg object is the segment driver (segvn_data in /usr/include/vm/segvn.h) which is printed as follows:
> 600110ff058::print "struct segvn_data"
{
lock = {
_opaque = [ 0 ]
}
segp_slock = {
_opaque = [ 0 ]
}
pageprot = 0
prot = 0xf
maxprot = 0xf
type = 0x2
offset = 0
vp = 0
anon_index = 0
amp = 0x60006d93788
vpage = 0
cred = 0x6000f19ef40
swresv = 0x4000
advice = 0
pageadvice = 0
flags = 0
softlockcnt = 0
policy_info = {
mem_policy = 0x1
mem_reserved = 0
}
}
The segvn_data structure contains information like how to protect the memory (which is accomplished through the MMU) and how much swap space to reserve for this segment. The field we are after is amp which is the address of yet another data structure called an anon map (/usr/include/vm/anon.h). This is how through the anon walker we get access to the individual anon structures which get allocated every time a process first references an anonymous page (a page backed by swap) like a heap page. The output of the anon walker shows that this process (/bin/sh) only has two pages of heap. This is consistent with the segment print output above.
A big picture of all of these data structures and what points (contains the address of) to what can be found in Figure 9.11 of Solaris Internals 2nd Edition (McDougall and Mauro)
> 0x60006d93788::walk anon
60010e4dbc0
60010e6fbd0
> 60010e4dbc0::print "struct anon"
{
an_vp = 0x600110c4fc0
an_pvp = 0
an_off = 0x60026de0000
an_poff = 0
an_hash = 0
an_refcnt = 0x1
}
Every page of DRAM which is all managed by the kernel is identified with a unique key called vnode and offset. The vnode (/usr/include/sys/vnode.h) is yet another important object in the kernel representing a file from any type of file system. The offset indicates which page of the file. There are thousands of page_t (/usr/include/vm/page.h) data structures maintained by the kernel to describe every page of DRAM. It tells us what pages from disk are currently being held in DRAM. There is basically one page_t record for every page of DRAM (8Kb or 0x2000 bytes in size on SPARC V9). The ::page dcmd will display every such record. Therefore you can always tell exactly what is being held in DRAM. Anonymous pages are implemented through a pseudo file system called swapfs. An anon structure gets created to store the vnode, offset pair to uniquely identify this page of anonymous (heap) memory. If the page had been paged out or swapped out, it would have the an_pvp and an_poff fields filled in to represent where on a swap device we paged out this page. We just need to search for this anonymous page's unique vnode/offset from the output of the ::page dcmd. The ! is a pipe symbol when piping to shell commands.
> ::page ! grep 60026de0000
000007000c71d580600055e60c060026de000000000 10
000007000e651a00600110c4fc060026de000000000 10
> ::page ! line
PAGEVNODEOFFSETSELOCK LCT COW IO FS ST
Hmmm, this is the first time two records have shown up. Notice the vnode addresses are different. These are swapfs vnodes. Lets print out each page_t structure. We are after the pagenum which is the physical DRAM page number.
> 000007000c71d580::print page_t
{
p_offset = 0x60026de0000
p_vnode = 0x600055e60c0
p_selock = 0
p_selockpad = 0
p_hash = 0
p_vpnext = 0x7000c71cf80
p_vpprev = 0x7000c71dc80
p_next = 0x7000c71d580
p_prev = 0x7000c71d580
p_lckcnt = 0
p_cowcnt = 0
p_cv = {
_opaque = 0
}
p_io_cv = {
_opaque = 0
}
p_iolock_state = 0
p_szc = 0x3
p_fsdata = 0
p_state = 0x10
p_nrm = 0x3
p_vcolor = 0x2
p_index = 0x8
p_toxic = 0
p_mapping = 0
p_pagenum = 0x14016b
p_share = 0
p_sharepad = 0
p_slckcnt = 0
p_kpmref = 0
p_kpmelist = 0
p_msresv_2 = 0
}
The following complicated command is key. It takes the page number and multiplies by the page size to get the physical DRAM address of the first page of the shell's heap. It then dumps 0x400 (1K) 8 byte chunks of memory in hexidecimal. The J is called a format letter of which there are dozens. They specify how many bytes of memory to display and in what base: Octal, Decimal, or Hexidecimal. Use ::formats to see them all. The \ indicates the address is to be interpreted as a physical not virtual address which is normally what you are working with in mdb. Finally, the a in front of the J indicates to display the address of each 8 byte chunk in front of the data. I am searching for 2321 which represents #! in hex (see ascii(5))
I spot it! Dump it out as a string since I know a script is ascii text:
> 0x2fcfe9610\s
0x2fcfe9610:#!/bin/sh
# pretend I am here to cause harm to your system, lets remove ourself first
/bin/rm /badscript
pwd
date
sleep 999
> $q
#
I realize this may be a bit complicated. I will try to start with the basics next time like "what is software?" in order to start describing DTrace and its capabilities.
Do you have any other demos of this nature? I'm more interested in the problem than the solution, though, as I'd like to figure them out for myself. Perhaps a running series of mdb brain-teasers, with solutions posted later?
Comments
Call me a geek, but this is
Call me a geek, but this is some cool stuff.
Do you have any other demos of this nature? I'm more interested in the problem than the solution, though, as I'd like to figure them out for myself. Perhaps a running series of mdb brain-teasers, with solutions posted later?
Chad Mynhier