Sun Solutions by Forsythe
David Rubio
Senior Consultant

Virtual Memory in Solaris 10

Sun, 03/23/2008 - 17:56 by David Rubio

Virtualization is the big buzz word these days. I would like to describe one of the older virtualization  techniques around: Virtual Memory (VM) on Solaris 10. Every 32-bit application, command, utility (e.g. ls, vi, fmd, acroread, oracle process, etc) is given 2 to the 32 = 4Gb of virtual memory to potentially use. 64-bit applications are given 16 Exabytes of virtual memory which is 4 billion times more than 32-bit programs. How can my system with only 8Gb of DRAM be running 120 processes that each have 4Gbs of memory and maybe 8 more that have 16 Ebs? The magic is that each process is seeing virtual memory not real or physical memory. The kernel's VM subsystem manages DRAM in units called pages which are currently either 4Kb on X86, X64 systems and 8Kb on SPARC V9 (UltraSparc) based systems. The  kernel is the guts or core of the Solaris 10 Operating system. The kernel manages a system's resources like CPUs, Memory and I/O devices. Each processor (core) stores the contents of memory or the addresses of (virtual) memory (called pointers in C) in what are called registers (the fastest memory on our systems). These registers doubled in size from 32-bits to 64-bits when SPARC V9 came out in the mid to late 90's. The size of an instruction like ld or add did not change. Instructions remained 32-bits. The X86 and AMD processors have variable length instructions. So, because a register can hold either a 32-bit virtual address (just half  of the register is used) or a 64-bit virtual address gives us the numbers 4Gb or 16Eb.

This is a common description I have heard over the years on what virtual memory is: "it is the sum of swap space and memory". I hate this definition which totally misses the point. A simpler more accurate description is that it is the memory given to the process which ranges from 0 to 4Gb (for 32-bit programs) addresses. These are not real DRAM addresses, hence the term virtual memory. This range of addresses is referred to as the process's address space. Some of the benefits of Virtual Memory are that we can 1. run (many) larger programs than the DRAM we have. 2. managing DRAM is easier for the OS (just keep track of the pages) 3. its cheaper because most of the program can be out on disk. Only the actively referenced pages are kept in DRAM. These pages are referred to as the resident set or working set. RSS in the output of prstat is the size of the process's resident set.

The virtual memory address space is broken into segments like text (code segment), data (variables) , heap (dynamically allocated memory), shared libraries (shared functions), and stack (frames containing function call details like input arguments). The VM subsystem does not map any segment to address zero. On SPARC the text is the segment at the lowest address but not address zero in order to catch a common bug in C/C++ programs which is called a null pointer dereference. Any references to address zero while a thread runs causes a segmentation violation core dump and the process dies. The stack segment is at the lowest virtual address for 32-bit x86 programs. The segments are further broken into fixed size units called pages. On a page by page basis we map their virtual addresses to actual physical DRAM page addresses. This is done through what are called page tables or mapping tables. Page tables which are loaded by the kernel are used by a piece of the processor called the Memory Management Unit (MMU) in order to do the virtual address to physical address translations. MMUs in turn use a fully associative or set associative cache called a Translation Lookaside Buffer (TLB) to speed up these translations. I plan on describing caches in a future blog. Programs get loaded into virtually contiguous pages of memory but these pages are not physically contiguous.

Wow, I guess there is more to this than I thought! If you wish more details look at page 27 and Chapters 8-13 of the Solaris Internals 2nd Edition by McDougall and Mauro. Anyway, here are some commands that show some of this stuff I have been talking about:

See if the binary executable file is 32 or 64-bit:

# file /usr/sbin/cron

/usr/sbin/cron: ELF 32-bit MSB executable SPARC Version 1, dynamically linked, stripped

# file /usr/bin/sparcv9/sort

/usr/bin/sparcv9/sort:  ELF 64-bit MSB executable SPARCV9 Version 1, dynamically linked, stripped

# file /usr/bin/amd64/sort

/usr/bin/amd64/sort: ELF 64-bit LSB executable AMD64 Version 1, dynamically linked, stripped

To determine your machine's page size:

# pagesize

8192

Find running cron's pid:

# pgrep cron

239

Show arguments and environment variables passed to cron:

# pargs -e 239

239:    /usr/sbin/cron

envp[0]: LOGNAME=root

envp[1]: PATH=/usr/sbin:/usr/bin

envp[2]: SMF_FMRI=svc:/system/cron:default

envp[3]: SMF_METHOD=/lib/svc/method/svc-cron

envp[4]: SMF_RESTARTER=svc:/system/svc/restarter:default

envp[5]: TZ=US/Central

To tell if an already running process is using 32-bit (_ILP32) or 64-bit (_LP64) virtual memory:

# pflags 239
239:    /usr/sbin/cron
        data model = _ILP32  flags = ORPHAN|MSACCT|MSFORK
 /1:    flags = ASLEEP  pollsys(0xffbffcb8,0x2,0xffbffc30,0x0)

# mdb
>

[1]+  Stopped                 mdb
# ps
   PID TTY         TIME CMD
 18935 console     0:00 bash
 18911 console     0:00 mdb
 18912 console     0:00 ps
# pflags 18911
18911:  mdb
        data model = _LP64  flags = MSACCT|MSFORK
 /1:    flags = STOPPED  kill(0x0,0x18)
        why = PR_JOBCONTROL  what = SIGTSTP

Show virtual memory layout of cron's segments. Notice below that the text starts at address 0x10000 (64Kb). Notice a big gap in addresses after heap segment which is currently unmapped but available virtual memory. The files under Mapped File label are the segment's backing store. The backing store for anon, heap, and stack segments is swap space. Backing store is where pages come from or go to during page ins and page outs. Read only segments like text do not need to go through a page out. Memory is simply freed if page daemon decides to steal a text page. Notice all of the numbers are multiples of 8 because this is a SPARC V9 machine. Notice not quite all of the C library text is resident in DRAM and that its memory is protected read/execute. The second libc.so.1 segment is the data segment of the C library and its memory is protected read/write/execute.

# pmap -x 239

239:    /usr/sbin/cron

 Address  Kbytes     RSS    Anon  Locked Mode   Mapped File

00010000      40      40       -       - r-x--  cron

0002A000      16      16      16       - rwx--  cron

0002E000      56      56      56       - rwx--    [ heap ]

FEF76000       8       8       -       - rwxs-    [ anon ]

...

FF180000     888     816       -       - r-x--  libc.so.1

FF26E000      32      32      32       - rwx--  libc.so.1

...

FFBFE000       8       8       8       - rw---    [ stack ]

-------- ------- ------- ------- -------

total Kb    2848    2616     256       -

Show sizes of text, data, bss for a binary executable. Data contains initialized variables while bss contains all the un-initialized variables. Bss and heap are combined into one segment. Text again contains the compiled code or instructions of the program:

# /usr/ccs/bin/size /usr/sbin/cron

35523 + 15014 + 5394 = 55931

The pstack command shows a user stack back trace (more commonly  called a stack trace) of each thread of a process (in this case my shell).  I will describe all the numbers in a future blog, but for now the output shows that bash started executing in the function named _start which called main which called reader_loop, etc until it gave up the CPU in waitid  to wait for the pstack command to exit.

# pstack $$

28083:  bash

 ff1c6d10 waitid   (7, 0, ffbffa40, f)

 ff1b9a34 waitpid  (ffffffff, ffbffb84, c, 0, 0, d4400) + 60

 00044564 ???????? (0, 1, ff00, c, fc00, 0)

 0004360c wait_for (7112, 4, e7fa8, d448c, 0, 0) + 128

 00033904 execute_command_internal (b, c0800, c1000, ffffffff, c1000, e7da8) + 44c

 0003331c execute_command (e7b08, c6400, e7ea8, 33000, 0, c6400) + 50

 00025c50 reader_loop (c0800, d4800, d3400, e7b08, c0800, 0) + 230

 00023b24 main     (0, c0800, c1000, 1, c0800, d4400) + 994

 00023178 _start   (0, 0, 0, 0, 0, 0) + 108

What I plan on showing in my next blog is how you can use mdb to dump the physical contents of any page of a running process. I will display the contents of a shell script while it is running. This is an example that used to go over well in my Solaris 10 Internals classes.