Sun Solutions by Forsythe
Jarod Jenson
Chief Technology Architect

Oracle Temp Table Trouble

Wed, 12/19/2007 - 21:07 by Jarod Jenson

So now that the first boring blog entry is out of the way, I think it is time to get to the fun stuff. I will try and highlight as many customer success stories as possible, but NDA's make it non-trivial.

A few months ago, I visited a customer that was having significant performance problems on a very large (think 25K ) DSS style Oracle database. We found a number of issues that, once corrected, greatly improved performance the database, and especially for one key batch job. One issue we found was particularly interesting, and I'll detail it here since it is something that could be affecting many customers. It is a relatively simple example of DTrace (nothing earth shattering), but something that would have been exceedingly difficult prior to DTrace—and this truly speaks to the power of DTrace. I'll keep this as brief as possible, so let's get started.

This particular database made heavy usage of Oracle temporary tables (and each of the files is 100GB in size). The customer had already identified that the performance of the temporary tables was most likely a contributor to their overall problem, but diagnosing it further was proving difficult. Of course, the first thing we look at is I/O performance to the file systems and devices that house the temporary tables. I rarely dive in with DTrace from the get-go, but I tend to lean on conventional tools to help me decide how to best apply DTrace. Here is a snippet from iostat(1M) for the relevant devices.

extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
86.8 262.8 29141.2 2305.5 0.0 1.8 0.0 5.2 0 54 d110
91.7 303.1 29597.5 2612.8 0.0 1.9 0.0 4.9 0 56 d120
85.0 225.0 28868.8 2010.3 0.0 1.8 0.0 5.9 0 51 d130
87.0 248.0 29269.7 2192.0 0.0 1.9 0.0 5.6 0 53 d140
89.9 277.9 29588.0 2407.1 0.0 1.9 0.0 5.1 0 53 d150

Looking at that iostat(1M), I suppose the I/O performance could be better; but that isn't too shabby (the reads are really affecting our service times—spinning copper ain't fast). However, if I look at the Oracle AWR Report for temp tables in the instance, we can see that Oracle claims our I/O performance isn't even in the ballpark of what iostat(1M) is reporting (TEMP tables only show read I/O latency in AWR).

Tablespace Filename
------------------------ ----------------------------------------------------
Av Av Av Av Buffer Av Buf
Reads Reads/s Rd(ms) Blks/Rd Writes Writes/s Waits Wt(ms)
-------------- ------- ------ ------- ------------ -------- ---------- ------
TEMP /u01/oradata/temp_01.dbf 46,517 12 66.1 5.9 46,363 12 0 N/A
TEMP /u02/oradata/temp_02.dbf 82,037 21 79.3 6.4 81,895 21 0 N/A
TEMP /u03/oradata/temp_03.dbf 72,631 18 35.5 7.0 35,007 9 0 N/A

Well, iostat(1M) is showing only our time from bdev_strategy() to biodone(). We still have a number of layers of software we have to stroll through before we get there (file system, SVM, multi-pathing, etc.). The simplest way to get some idea of what is happening on our way out the box is to use DTrace of course.

# dtrace -n io:::start'/(args[0]->b_flags & B_READ) == 0 && args[2]->fi_pathname == "/u01/oradata/temp_01.dbf"/{@[stack()] = count()}'
dtrace: description 'io:::start' matched 6 probes
^C


genunix`bdev_strategy+0x44
ufs`lufs_write_strategy+0x11c
ufs`ufs_putapage+0x378
ufs`ufs_putpages+0x2a4
genunix`segmap_release+0x344
ufs`wrip+0x850
ufs`ufs_write+0x580
genunix`pwrite+0x22c
unix`syscall_trap+0xac
5131

genunix`bdev_strategy+0x44
md_stripe`md_stripe_strategy+0x2f4
md`mdstrategy+0xd4
ufs`lufs_write_strategy+0x11c
ufs`ufs_putapage+0x378
ufs`ufs_putpages+0x2a4
genunix`segmap_release+0x344
ufs`wrip+0x850
ufs`ufs_write+0x580
genunix`pwrite+0x22c
unix`syscall_trap+0xac
5131

Whoa, wait just a second. Something here is definitely not right. This isn't what I expected at all. Shouldn't Oracle be using the directio code path (ufs_directio_write)? The file systems are mounted with forcedirectio and we know that Oracle is using O_DSYNC; then directio should be a no-brainer, right?

Well, one thing I do know is that we can't get directio to a file with holes (let alone full concurrent directio). Why don't we take a quick peek to see if any of the files in question have holes?

The following script should help:

<SNIP has_holes.d>
#!/usr/sbin/dtrace -s

syscall::pwrite:entry
{
self->file = fds[arg0].fi_pathname;
}

fbt::bmap_has_holes:return
/ arg1 == 1 /
{
self->holey = 1;
}

syscall::pwrite:return
/ self->holey /
{
@[self->file] = count();
self->holey = 0;
}

syscall::pwrite:return
{
self->file = 0;
}
</SNIP>

And we give this a shot:

# ./has_holes.d
dtrace: script './has_holes.d' matched 4 probes
^C

/u03/oradata/temp_03.dbf 62
/u01/oradata/temp_01.dbf 79
/u02/oradata/temp_02.dbf 85
/u04/oradata/temp_04.dbf 104

Well then, it certainly seems like we may be on to something. Just for fun, let's look at lockstat(1M) (of course implemented as a DTrace consumer) to see if by chance this is causing us any grief in the form of lock contention (something we'll get without full concurrent directio semantics).

# lockstat -D 25 -w sleep 10

[SNIP]

R/W writer blocked by writer: 781 events in 10.066 seconds (78 events/sec)

Count indv cuml rcnt nsec Lock Hottest Caller
-------------------------------------------------------------------------------
333 43% 43% 0.00 1531174376 0x300325bbdc0 ufs_write+0x264
174 22% 65% 0.00 794978361 0x300325bb970 ufs_write+0x264
97 12% 77% 0.00 256393350 0x300325bbae0 ufs_write+0x264
51 7% 84% 0.00 101298840 0x300325bb240 ufs_write+0x264
49 6% 90% 0.00 61664589 0x300325bbc50 ufs_write+0x264
45 6% 96% 0.00 65377586 0x300325bb3b0 ufs_write+0x264
23 3% 99% 0.00 37298990 0x300325bb520 ufs_write+0x264
5 1% 99% 0.00 30051030 0x300325bb800 ufs_write+0x264
3 0% 100% 0.00 47081638 0x300325c20f8 ufs_write+0x264
1 0% 100% 0.00 19835412 0x300325c2268 ufs_write+0x264

[SNIP]

Youch! Definitely no love here. So what we have is an issue with Oracle temporary tables, sparseness, and directio semantics. According to AskTom (which by the way I love), we see that this is by design.

The recommendation I made here was to create the tablespaces using conventional Oracle mechanisms (not temporary tables), drop the tablespace but retain the underlying storage, and to then create the temporary tablespaces using the REUSE option to the CREATE statement - this avoids the sparse file problem.

Once this was implemented, the batch job runtime dropped to just 45% of its original time (actually, the overall improvement with all recommendations (not detailed here) was that the batch job ran in about 1/5th of the time it previously took). Sweet.

Like I said earlier, this is a trivial example of DTrace in action, but something that would have been pure conjecture and tested by trial and error in the past. Getting to this recommendation took just a short time and was virtually guaranteed to be successful.

So if you are running Oracle and have temporary tablespaces, I would highly recommend trying the has_holes.d script from above to see if you could be impacted by this issue as well.



Comments

I wonder if the same happens

I wonder if the same happens (no direct io to files with holes) if we use Oracle's option: filesystemio_options = setall. I guess it's the same but ....

I rarely dive in with

I rarely dive in with DTrace from the get-go, but I tend to lean on conventional tools to help me decide how to best apply DTrace.

That's the one thing that I'm glad to see written down somewhere! That's the trap with a tool as powerful as DTrace - everything is possible with it, but it's less than helpful without having an inkling of where to start looking.

Related Entries