DTrace at its best!

January 23, 2009

Using DTrace‘s destructive actions, you can perform actions on your operating system you never thought of before, like backup files with a zfs snapshot right before a user deletes them, or halt any process before it is ended.

With the help of the files in directory /usr/demo/dtrace, the DTrace Toolkit, and my colleagues who showed me the best probe for intervening before a process is really ended, I wrote and tested the following short DTrace script to halt certain processes before they are ended. It can be very useful if you encounter a large number of short-lived processes which you could not analyze otherwise. In the following example, we are looking only for processes running with userid 4 (username adm) and with the executable name date. I saved it as file name stopper.d.

#!/usr/sbin/dtrace -ws
syscall::rexit:entry
/(execname == "date") && uid == 4/
{
printf ("%d(%d): %d %d %d %d, %s, >%s<: %Y", pid, ppid, uid,
curpsinfo->pr_projid, curpsinfo->pr_zoneid,
curpsinfo->pr_dmodel, cwd, curpsinfo->pr_psargs, walltimestamp);
/*   stack(); */
/*   ustack(); */
/*   system ("pmap -x %d", pid); */
printf ("\nStopping Process %d ...", pid);
stop();
printf (" done.");
system ("ps -eo user,pid,ppid,s,zone,projid,pri,class,nice,args | nawk '$2==\"%d\"{print}'", pid);
}

Be warned! Adapt the filter rules carefully on a test system before using the script on the system on which you want to halt processes! Use the script on your own risk – I cannot guarantee for anything!

For listing the stopped processes, you can use the following command:

$ ps -eo user,pid,ppid,s,zone,projid,pri,class,nice,args | \
nawk '$4=="T" && /date/{print}'

And for ending these stopped processes, you can use that one: 

$ ps -eo user,pid,ppid,s,zone,projid,pri,class,nice,args | \
nawk '$4=="T" && /date/{system ("kill -9 "$2)}'

For testing, I created a script named start-50-date-processes.ksh to start 50 date processes roughly at the same time, then started the DTrace script above as user root:

./stopper.d

and afterwards started the test script as user adm:

./start-50-date-processes.ksh

A sample output looks like:

$ ./stopper.d
dtrace: script './stopper.d' matched 1 probe
dtrace: allowing destructive actions
CPU     ID  FUNCTION:NAME
1   3413  rexit:entry 21922(5058): 4 3 0 1, /var/adm/bin, >date<: 2009 Jan 23 13:47:37
Stopping Process 21922 ... done.
adm 21922  5058 T   global     3  57   IA 24 date
1   3413  rexit:entry 22005(5058): 4 3 0 1, /var/adm/bin, >date<: 2009 Jan 23 13:47:41
Stopping Process 22005 ... done.
adm 22005  5058 T   global     3  47   IA 24 date
0   3413  rexit:entry 22090(22089): 4 3 0 1, /var/adm/bin, >date<: 2009 Jan 23 13:47:56
Stopping Process 22090 ... done.
adm 22090     1 T   global     3  44   IA 20 date
... (some more lines)

For stopping the DTrace script, just press <ctrl>c in the window where you started it. Stopping the script will not let the stopped processes continue – they remain in the "T" (for Trace) status until they are killed.

The DTrace script will run faster if you comment out its last line (where it executes the ps command for each stopped process).

And here’s the test script (for starting 50 date processes) which I executed as user adm:

$ cat start-50-date-processes.ksh
#!/bin/ksh
i=50
while [[ i -gt 0 ]]; do
date &
(( i = i - 1 ))
# or (( i-- )) with ksh93 on Solaris 10, or OpenSolaris
done

(n)awk: print matching lines and some more

January 21, 2009

I think some of you will find the following (n)awk one-liner useful. I am using it from time to time and thought I should document it so that I do not have to think about where to place the "a++" part, for example.

It will print out lines of a file that match a certain pattern, plus some more lines that follow. In this example, I am searching for lines that contain the string "usb" in /var/adm/messages, and print that line plus 4 more (for more lines, decrement the number in the first curly brackets accordingly). A line number will also be printed out for each line:

$ nawk '/usb/{a=-5}{if (a<0){print NR, $0};a++}' /var/adm/messages

How to collect relevant data for application core analysis

January 21, 2009

Not sure if you knew already – there is a tool available which collects all relevant data for analyzing application cores, for example by Sun’s service and support engineers. It’s called pkgapp (not pkgadd 😉 ) and can be directly downloaded by clicking this link.

When started (after extracting in a directory like /opt/pkgapp) using:

$ /opt/pkgapp/pkgapp -c /path_to_core_file -p /path_to_executable

it collects all libraries for that executable, several proc tool outputs, and more, and packs it into a tar file in a new directory below the current directory.

If you would like to test it, you can just create a core with the gcore command:

$ gcore PID

Where PID is the process ID of a currently running process for which you would like to create the core file. The process will not be killed; it will continue running.

A sample session looks like the following (hostname and hostid replaced by dummy entries):

$ id
uid=0(root) gid=0(root)
$ /usr/bin/sleep 30&
[1]     17825
$ gcore 17825
gcore: core.17825 dumped
$ /opt/pkgapp/pkgapp -c core.17825 /usr/bin/sleep
* ----------------------------------------------------------------------------------
* Sun Microsystems RSD pkgapp 3.0 Solaris                               [01/21/2009]
* ----------------------------------------------------------------------------------
* OS release                            [5.11]
* Platform                              [SUNW,Sun-Blade-1000]
* Checking [-c] is a core or pid        [using core /var/tmp/pkgapp/core.17825]
* Checking corefile for a valid pldd    [pldd is good with 4 elements]
* Process root                          [/usr/bin/sleep]
* Databin parameter [-s] checks         [reset to /var/tmp/pkgapp]
* Databin found                         [/var/tmp/pkgapp]
* Databin writable check                [success]
* Databin used/created is               [/var/tmp/pkgapp/pkgapp-012109-02]
* Creating temp area                    [/tmp/pkgapp.18627/]
* Checking if corefile is truncated     [core seems truncated - may not be useful]
* Checking if corefile cksum = filesz   [core cksum matches file size, may be fine]
* Process binary                        [sleep]
* Checking usage history                [not recently run]
* sleep binary bit version              [32]
* Checking path [-p] to binary name     [failed, path includes binary name]
* Resetting path [-p] parameter         [/usr/bin]
* Checking path [-p] is a directory     [success]
* Locating sleep                        [success]
* Checking located sleep is 32 bit      [success]
* Binary located                        [/usr/bin/sleep]
* Adding binary to pkgapp.pldd          [success]
* Grabbing pldd                         [success]
* Grabbing pstack                       [success]
* Grabbing pmap                         [success]
* Grabbing pcred                        [success]
* Grabbing pflags                       [success]
* Grabbing pargs                        [success]
* Not Including the core/gcore
* Javatools [-j] not set                [skipped]
* Grabbing /var/adm/messages            [success]
* Grabbing uname -a                     [success]
* Grabbing date/time                    [success]
* Grabbing showrev -p                   [success]
* Grabbing pkginfo -l                   [success]
* Grabbing /etc/release                 [success]
* Grabbing coreadm                      [success]
* Grabbing ulimit                       [success]
* Grabbing libs                         [success]
* Making lib paths app/                 [success]
* Making lib paths libs/                [success]
* Processing file 1 of 48
* Processing file 2 of 48
* Processing file 3 of 48
...
* Processing file 46 of 48
* Processing file 47 of 48
* Processing file 48 of 48
* Linking libraries                     [success]
* Libraries linked                      [48 ttl]
*
* Using hostid for naming .tar.gz       [12345678]
* Writing file                          [pkgapp-12345678-sol01-090121-094026.tar.gz]
*
* Done gathering files
* Writing dbxrc & opencore.sh files     [success]
* Writing manifest-090121-094026.log    [success]
* Writing pkgapp-args-090121-094026     [success]
* Creating final tarfile                [success]
* Compressing tarfile                   [success]
* End of runtime logging
* Saving history info                   [/var/tmp/pkgapp-history/history-090121-094026.log]
* Saving runtime log                    [/var/tmp/pkgapp-history/runtime-090121-094026.log]
* Removing [-r] temp area/files         [left alone]
*
* Operations Complete
----------------------------------------------------------------------------------
Upload the following file(s) to your supportfiles.sun.com Cores Directory at Sun
1) File(s) located in directory /var/tmp/pkgapp/pkgapp-012109-02
[ pkgapp-12345678-sol01-090121-094026.tar.gz ]
2) File(s) located in directory /var/tmp/pkgapp/
[ core.17825 ]
Note: pkgapp has not included the core.17825 with the above pkgapp tar
Please rename the file appropriately and upload to the same location
Thank you.
Sun Software Technology Service Center (STSC)
...

Mac OS X: Moving the cursor word by word

January 20, 2009

Tired of testing of <ctrl><alt><cmd> with the left or right arrow key to find out how to move the cursor word by word for moving the input mark or for extending the character selection?

Maybe the small table below is a useful word-by-word and beginning/end of line movement reminder for you when working with Mac OS X:

Application Version move cursor w/w change selection w/w move cursor b/eol change selection b/eol
Textedit  1.5  <alt>  <shift><alt>  <cmd>  <shift><cmd>
Firefox  3.0.5  <alt>  <shift><alt>  ?  ?
Mac Mail  3.5  <alt>  <shift><alt>  <cmd>  <shift><cmd>
StarOffice  9  <alt>  <shift><cmd>  <cmd>  ?
NeoOffice  2.2.5  <cmd>  <shift><cmd>  <fn>  <shift><fn>
Adium  1.3.2  <alt>  <shift><alt>  <ctrl>  <shift><ctrl>
Nedit (X11)  ?  <ctrl>  <shift><ctrl>  ?  ?

Time for a Change

January 20, 2009

Read this book. Reduce your impact on the environment: Try to consume 25% less electrical energy this year. Ask for FSC certified wood products. Use your bike (with or without a trailer) for distances less than 3 km/2 miles. When driving a car, drive less fast.

Why? Read this book. Or maybe because you need a new challenge. Or just because it’s fun to impress your neighbors (who are used to see you in a car) by coming home from shopping or from work on a bike.

Solaris booting from ZFS, explained by Lori Alt

January 15, 2009

Lori Alt, project lead for the ZFS boot project, explains in this video how booting Solaris from ZFS works.

The first 5 minutes covers the main feaures of ZFS. At about 05:45, Lori starts describing the Solaris boot process (SPARC and x86) and the special considerationg when using ZFS for booting, including some remarks on grub. A great primer if you want to know what happens when Solaris is booting. The video is also available in m4v format for viewing on an iPod, for example.

Programmer’s top 25…

January 14, 2009

… programming errors with regards to security have been published a few days ago here, grouped into

I believe it’s a good idea to read through the list from time to time, or to use it when planning QA.

Read a related article on Heise Online here. For the original German version, click here.

Mac OS X: often used UNIX keys on the German keyboard

January 10, 2009

Enough is enough. Enough testing of <ctrl><alt><cmd> with any possible key to find out where the [, ], |, \, {, } or even ~ keys are on my German keyboard. Or using <alt><cmd>t to display the character map in the Terminal aplication to get those keys.

So here it is, my personal keyboard map reminder for the Mac OS X:

|  pipe symbol  <alt>7
\  backslash  <alt><shift>7 = <alt>/
[  left (opening) square bracket  <alt>5
]  right (closing) square bracket  <alt>6
{  left (opening) curly bracket  <alt>8
}  right (closing) curly bracket  <alt>9
~  Tilde  <alt>n followed by the space key
@  "At" symbol  <alt>l (owercase L)

My favorite is the @ symbol! This is because if you have just worked on a PC (where the @ symbol is created by pressing the <AltGr> – which is just right of the space bar – and then the letter q), and now on a Mac also press the key right of the space bar (it’s the cmd key) with the letter q, you will end your current program – sometimes without a confirmation. That’s the hard way of learning how to enter the @ symbol on a Mac.

The good news is: If you run OpenSolaris inside VirtualBox, you can press the same key combinations (use <right alt> on the mac instead of the <AltGr>) as you are used to on a PC to get the UNIX special keys, with the exception of the pipe (|), less than (<), and greater than (>) symbols. These ones can be reached by pressing using the ^ key instead of the < key.

ZFS and the uberblock, part 2

January 9, 2009

In my last blog entry, I described how to use DTrace to learn more about how and where the ZFS uberblock is written. I showed that an uberblock update causes 4 writes to 4 different areas, and only after 128 write cycles will an old uberblock be overwritten by a new one. Now, using zdb, I will show how often various zpools update their uberblocks so we can examine after which time a USB flash without wear leveling would probably be worn out.

Using zdb -uuuv, as in the following script:

#!/bin/ksh
TMPDIR=/var/tmp
LOGFILE=${TMPDIR}/ub-stats.out
echo "-----" >> ${LOGFILE}
date +"date+time: %Y-%m-%d %H:%M:%S" >> ${LOGFILE}
for ZPOOL in $(zpool list -o name -H); do
echo ${ZPOOL} >> ${LOGFILE}
zdb -uuuv ${ZPOOL} >> ${LOGFILE}
done

, we can show important information about the uberblocks of each zpool and write it to a file (for example once per minute, per crontab). A timestamp is useful for further analysis.

After some some hours or days, we can extract the data of certain zpools, import it into a spreadsheet and create graphs to show how often new uberblocks are written to disk. Here’s how we can print the data for rpool:

nawk '/date/{date=$2; time=$3}
/pool/{pool=$NF}
/txg/{printf ("%s %s\t%s\t%s\n", date, time, pool, $NF)}' ub-stats.out

A sample output might look like:

2009-01-06 22:14:00     dpool   66273
2009-01-06 22:14:00     dpool-2 58707
2009-01-06 22:14:00     rpool   33165
2009-01-06 22:15:00     dpool   66273
2009-01-06 22:15:00     dpool-2 58707
2009-01-06 22:15:00     rpool   33167
2009-01-06 22:16:00     dpool   66273
2009-01-06 22:16:00     dpool-2 58707
2009-01-06 22:16:00     rpool   33169
2009-01-06 22:17:00     dpool   66273
2009-01-06 22:17:00     dpool-2 58707
2009-01-06 22:17:00     rpool   33171

We can easily extract just one pool, using nawk ‘$3=="dpool"{print}’, for example, read the resulting file into OpenOffice, and create a graph of type x/y to show the txg numbers over time. For rpool, I got the following picture:

txg over time (rpool)

The data (you can even see it in the above table) shows that for rpool, txg is incremented at about two times per minute. Each additional manual "sync" command increments txg by one and also writes the uberblock to disk. Assuming that a USB flash drive without wear leveling wears out after data is written to the same location 10.000 times (which is the minimum guaranteed number of cycles for a standard USB flash drive mentioned in this document), knowing that the same block is overwritten 2*60*24/128 = less than 23 times after one day, we can calculate 10.000/22.5 to get the number of days until such a USB stick would be worn out. The result is about 444 days (=1 year 79 days). Assuming the USB stick is worn out only after 100.000 writes to the same location, the result would be 4444 days which is more than 12 years! At the end of this blog entry, I will explain why the USB stick might even live much longer than those 444 or 4444 days.

Interestingly, for another pool, name dpool, I got a completely different graph:

txg over time (dpool)

What I know is that on this pool, at about 06:00 am, a small number of files (about 10 to 20) is been created, again at 07:00 am, and during other times, there is no or not much write activity. Using another dtrace script, we can watch in real time and see that the uberblock’s txg number is indeed increased every 30 seconds but only written to disk when there is a change to the pool to which it belongs or, for example, when a sync command is issued on the command line. Here’s the script I used (a colleague recommended to also include spa_sync as this is the call for synching a transaction group, so I added two more probes for watching these events):

io:genunix:bdev_strategy:start
{
printf ("%Y %s %d %d", walltimestamp, execname, args[0]->b_blkno, args[0]->b_bcount);
}
fbt:zfs:spa_sync:entry
{
printf ("%Y %s %d", walltimestamp, execname, args[1]);
}
fbt:zfs:spa_sync:return,
fbt:zfs:vdev_uberblock_sync:entry
{
printf ("%Y %s", walltimestamp, execname);
}
fbt:zfs:uberblock_update:entry
{
printf ("%Y %d %d %d", walltimestamp,
args[0]->ub_rootbp.blk_birth, args[0]->ub_rootbp.blk_fill,
args[2]);
}
fbt:zfs:uberblock_update:return
{
printf ("%Y %d %d", walltimestamp, args[0], args[1]);
}

And here’s a sample output. I removed certain columns, including the hour (19 p.m.) from the original output. The times are minutes:seconds. For better readability, I have marked all entries and comments regarding writing the uberblock update for rpool to disk in green color, and all actions I performed in another terminal window while the dtrace script was running in red color.

           spa_sync:entry 11:09 sched 41533           <- spa_sync() rpool
bdev_strategy:start 11:09 sched 25671451 512
bdev_strategy:start 11:09 sched 4310260 512
bdev_strategy:start 11:09 sched 25571476 3072
bdev_strategy:start 11:09 sched 25671452 14848
spa_sync:entry 11:09 sched 69698           <- spa_sync() dpool-2
uberblock_update:entry 11:09 67082 178 69698       <- updating dpool-2. txg != blk_birth!
uberblock_update:return 11:09 85 0                  <- args[1]=0: no sync!
spa_sync:return 11:09 sched
spa_sync:entry 11:09 sched 76452           <- spa_sync() dpool
uberblock_update:entry 11:09 75007 122 76452       <- updating dpool. Note that txg is 76452
whereas blk_birth is 75007: means that
this file system has not been changed for
(76452-75007)/2+1 = 723m 30s = 12h 3m 30s
So the last change to dpool was
at about 7:08 am. This is exactly the time
stamp as shown in zdb -uuuv dpool. And the
txg number shown in that output is the one
that is read from disk, so it is 75007:
(zdb output follows; dtrace output will be continued below)
---
$ zdb -uuuv dpool
Uberblock
magic = 0000000000bab10c
version = 13
txg = 75007
guid_sum = 10502025224250020770
timestamp = 1231481321 UTC = Fri Jan  9 07:08:41 2009
rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:fd8c54a00:200> DVA[1]=<0:1f8b7f8800:200> DVA[2]=<0:c02e0600:200> fletcher4 lzjb LE contiguous birth=75007 fill=122 cksum=a0e45aac0:42a5af596ce:e0b8e86ece60:201ef17f0b98da
---
(dtrace output continued)
uberblock_update:return 11:09 85 0                  <- args[1]=0: no sync!
spa_sync:return 11:09 sched
bdev_strategy:start 11:09 sched 4310261 27136
bdev_strategy:start 11:09 sched 25571482 15360
.
... some more IO of various blocks and sizes
.
bdev_strategy:start 11:09 sched 25671511 512
bdev_strategy:start 11:09 sched 25671517 5632
uberblock_update:entry 11:09 41533 159 41533       <- updating rpool. txg == blk_birth!
uberblock_update:return 11:09 85 1                  <- args[1]=1: sync!
vdev_uberblock_sync:entry 11:09 sched
      bdev_strategy:start 11:09 sched 378 1024        <- write to block #378!
bdev_strategy:start 11:09 sched 890 1024 bdev_strategy:start 11:09 sched 39005050 1024 bdev_strategy:start 11:09 sched 39005562 1024 spa_sync:return 11:09 sched spa_sync:entry 11:39 sched 41534 <- spa_sync() rpool uberblock_update:entry 11:39 41533 159 41534 <- updating rpool. txg != blk_birth! uberblock_update:return 11:39 85 0 <- args[1]=0: no sync! skip writing to block #380!
spa_sync:return 11:39 sched spa_sync:entry 11:39 sched 76453 <- spa_sync() dpool uberblock_update:entry 11:39 75007 122 76453 uberblock_update:return 11:39 85 0 spa_sync:return 11:39 sched spa_sync:entry 11:39 sched 69699 <- spa_sync() dpool-2 uberblock_update:entry 11:39 67082 178 69699 uberblock_update:return 11:39 85 0 spa_sync:return 11:39 sched spa_sync:entry 12:09 sched 69700 <- spa_sync() dpool-2 uberblock_update:entry 12:09 67082 178 69700 uberblock_update:return 12:09 85 0 spa_sync:return 12:09 sched spa_sync:entry 12:09 sched 76454 <- spa_sync() dpool uberblock_update:entry 12:09 75007 122 76454 uberblock_update:return 12:09 85 0 spa_sync:return 12:09 sched spa_sync:entry 12:09 sched 41535 <- spa_sync() rpool bdev_strategy:start 12:09 sched 25571530 1024 bdev_strategy:start 12:09 sched 4310376 1024 . ... some more IO of various blocks and sizes . bdev_strategy:start 12:09 sched 25671620 512 bdev_strategy:start 12:09 sched 25671626 5120 uberblock_update:entry 12:09 41535 159 41535 <- updating rpool. txg == blk_birth! uberblock_update:return 12:09 85 1 <- args[1]=1: sync! vdev_uberblock_sync:entry 12:09 sched bdev_strategy:start 12:09 sched 382 1024 <- write to block #382!
bdev_strategy:start 12:09 sched 894 1024 bdev_strategy:start 12:09 sched 39005054 1024 bdev_strategy:start 12:09 sched 39005566 1024 spa_sync:return 12:09 sched
(sync command issued in another terminal window)

(end sync command)
spa_sync:entry 12:23 sched 76455 <- spa_sync() dpool uberblock_update:entry 12:23 75007 122 76455 uberblock_update:return 12:23 85 0 spa_sync:return 12:23 sched spa_sync:entry 12:23 sched 69701 <- spa_sync() dpool-2 uberblock_update:entry 12:23 67082 178 69701 uberblock_update:return 12:23 85 0 spa_sync:return 12:23 sched spa_sync:entry 12:23 sched 41536 <- spa_sync() rpool bdev_strategy:start 12:23 sched 25671636 1536 bdev_strategy:start 12:23 sched 25671639 2560 . ... some more IO of various blocks and sizes . bdev_strategy:start 12:23 sched 25671718 3584 bdev_strategy:start 12:23 sched 25682884 2048 uberblock_update:entry 12:23 41536 159 41536 <- updating rpool. txg == blk_birth! uberblock_update:return 12:23 85 1 <- args[1]=1: sync! vdev_uberblock_sync:entry 12:23 sched bdev_strategy:start 12:23 sched 384 1024 <- write to next block, #384!
bdev_strategy:start 12:23 sched 896 1024 bdev_strategy:start 12:23 sched 39005056 1024 bdev_strategy:start 12:23 sched 39005568 1024 spa_sync:return 12:23 sched bdev_strategy:start 12:53 sched 234913576 12288 bdev_strategy:start 12:53 sched 234910656 16384 spa_sync:entry 12:53 sched 41537 <- spa_sync() rpool bdev_strategy:start 12:53 sched 25671725 14848 bdev_strategy:start 12:53 sched 25571590 1024 bdev_strategy:start 12:53 sched 4310548 1024 bdev_strategy:start 12:53 sched 25571588 1024 spa_sync:entry 12:53 sched 69702 <- spa_sync() dpool-2 bdev_strategy:start 12:53 sched 234915243 512 bdev_strategy:start 12:53 sched 306224021 512 bdev_strategy:start 12:53 sched 234913640 12288 bdev_strategy:start 12:53 sched 234915250 1024 spa_sync:entry 12:53 sched 76456 <- spa_sync() dpool uberblock_update:entry 12:53 75007 122 76456 uberblock_update:return 12:53 85 0 spa_sync:return 12:53 sched bdev_strategy:start 12:53 sched 234915252 1536 bdev_strategy:start 12:53 sched 234915256 7168 . ... some more IO of various blocks and sizes . bdev_strategy:start 12:53 sched 379625154 1536 bdev_strategy:start 12:53 sched 379625160 4608 uberblock_update:entry 12:53 41537 159 41537 <- updating rpool. txg == blk_birth! uberblock_update:return 12:53 85 1 <- args[1]=1: sync! uberblock_update:entry 12:53 69702 178 69702 uberblock_update:return 12:53 85 1 <- args[1]=1: sync! vdev_uberblock_sync:entry 12:53 sched bdev_strategy:start 12:53 sched 396 1024 bdev_strategy:start 12:53 sched 908 1024 bdev_strategy:start 12:53 sched 390704524 1024 bdev_strategy:start 12:53 sched 390705036 1024 spa_sync:return 12:53 sched vdev_uberblock_sync:entry 12:54 sched bdev_strategy:start 12:54 sched 386 1024 <- write to next block, #386
bdev_strategy:start 12:54 sched 898 1024 bdev_strategy:start 12:54 sched 39005058 1024 bdev_strategy:start 12:54 sched 39005570 1024 spa_sync:return 12:54 sched (wrote file in dpool-2, with :w in vi) 0 6668 bdev_strategy:start 2009 Jan 9 19:13:08 sched 234913600 20480 (done writing) spa_sync:entry 13:23 sched 41538 <- spa_sync() rpool uberblock_update:entry 13:23 41537 159 41538 <- updating rpool. txg != blk_birth! uberblock_update:return 13:23 85 0 <- args[1]=0: no sync! skip writing to block #388!
spa_sync:return 13:23 sched spa_sync:entry 13:23 sched 69703 <- spa_sync() dpool-2 bdev_strategy:start 13:23 sched 234915299 512 bdev_strategy:start 13:23 sched 306224077 512 bdev_strategy:start 13:23 sched 234915300 512 bdev_strategy:start 13:23 sched 234915302 1024 spa_sync:entry 13:23 sched 76457 <- spa_sync() dpool uberblock_update:entry 13:23 75007 122 76457 uberblock_update:return 13:23 85 0 spa_sync:return 13:23 sched bdev_strategy:start 13:23 sched 234915304 1536 bdev_strategy:start 13:23 sched 234915308 7168 . ... some more IO of various blocks and sizes . bdev_strategy:start 13:23 sched 379625170 6144 bdev_strategy:start 13:23 sched 379625184 6144 uberblock_update:entry 13:23 69703 178 69703 <- updating dpool-2. txg == blk_birth! uberblock_update:return 13:23 85 1 <- args[1]=1: sync! vdev_uberblock_sync:entry 13:23 sched bdev_strategy:start 13:23 sched 398 1024 bdev_strategy:start 13:23 sched 910 1024 bdev_strategy:start 13:23 sched 390704526 1024 bdev_strategy:start 13:23 sched 390705038 1024 spa_sync:return 13:23 sched (rm file in dpool-2)

(done rm) spa_sync:entry 13:53 sched 69704 <- spa_sync() dpool-2 bdev_strategy:start 13:53 sched 234915348 512 bdev_strategy:start 13:53 sched 306224124 512 bdev_strategy:start 13:53 sched 234915364 1024 bdev_strategy:start 13:53 sched 306224140 1024 spa_sync:entry 13:53 sched 76458 <- spa_sync() dpool uberblock_update:entry 13:53 75007 122 76458 uberblock_update:return 13:53 85 0 spa_sync:return 13:53 sched spa_sync:entry 13:53 sched 41539 <- spa_sync() rpool bdev_strategy:start 13:53 sched 25571606 15360 bdev_strategy:start 13:53 sched 25571642 1024 . ... some more IO of various blocks and sizes . bdev_strategy:start 13:53 sched 379625196 6144 bdev_strategy:start 13:53 sched 379625210 6144 uberblock_update:entry 13:53 69704 178 69704 <- updating dpool-2. txg == blk_birth! uberblock_update:return 13:53 85 1 <- args[1]=1: sync! bdev_strategy:start 13:53 sched 25671818 2560 bdev_strategy:start 13:53 sched 4310632 2560 . ... some more IO of various blocks and sizes . bdev_strategy:start 13:53 sched 25671823 3584 bdev_strategy:start 13:53 sched 25682892 2048 uberblock_update:entry 13:53 41539 159 41539 <- updating rpool. txg == blk_birth! uberblock_update:return 13:53 85 1 <- args[1]=1: sync! vdev_uberblock_sync:entry 13:53 sched bdev_strategy:start 13:53 sched 400 1024 bdev_strategy:start 13:53 sched 912 1024 bdev_strategy:start 13:53 sched 390704528 1024 bdev_strategy:start 13:53 sched 390705040 1024 spa_sync:return 13:53 sched vdev_uberblock_sync:entry 13:54 sched bdev_strategy:start 13:54 sched 390 1024 <- write to block #390
bdev_strategy:start 13:54 sched 902 1024 bdev_strategy:start 13:54 sched 39005062 1024 bdev_strategy:start 13:54 sched 39005574 1024 spa_sync:return 13:54 sched spa_sync:entry 14:23 sched 41540 <- spa_sync() rpool uberblock_update:entry 14:23 41539 159 41540 <- updating rpool. txg != blk_birth! uberblock_update:return 14:23 85 0 <- args[1]=0: no sync! skip writing to block #392!
spa_sync:return 14:23 sched spa_sync:entry 14:23 sched 69705 <- spa_sync() dpool-2 uberblock_update:entry 14:23 69704 178 69705 <- updating dpool-2. txg != blk_birth! uberblock_update:return 14:23 85 0 <- args[1]=0: no sync! spa_sync:return 14:23 sched spa_sync:entry 14:23 sched 76459 <- spa_sync() dpool uberblock_update:entry 14:23 75007 122 76459 <- updating dpool. txg != blk_birth! uberblock_update:return 14:23 85 0 <- args[1]=0: no sync! spa_sync:return 14:23 sched spa_sync:entry 14:53 sched 76460 <- spa_sync() dpool uberblock_update:entry 14:53 75007 122 76460 <- updating dpool. txg != blk_birth! uberblock_update:return 14:53 85 0 <- args[1]=0: no sync! spa_sync:return 14:53 sched spa_sync:entry 14:53 sched 69706 <- spa_sync() dpool-2 uberblock_update:entry 14:53 69704 178 69706 <- updating dpool-2. txg != blk_birth! uberblock_update:return 14:53 85 0 <- args[1]=0: no sync! spa_sync:return 14:53 sched spa_sync:entry 14:53 sched 41541 <- spa_sync() rpool uberblock_update:entry 14:53 41539 159 41541 <- updating rpool. txg != blk_birth! uberblock_update:return 14:53 85 0 <- args[1]=0: no sync! skip writing to block #394!
spa_sync:return 14:53 sched spa_sync:entry 15:23 sched 41542 <- spa_sync() rpool uberblock_update:entry 15:23 41539 159 41542 <- updating rpool. txg != blk_birth! uberblock_update:return 15:23 85 0 <- args[1]=0: no sync! skip writing to block #396!
spa_sync:return 15:23 sched spa_sync:entry 15:23 sched 69707 <- spa_sync() dpool-2 uberblock_update:entry 15:23 69704 178 69707 <- updating dpool-2. txg != blk_birth! uberblock_update:return 15:23 85 0 <- args[1]=0: no sync! spa_sync:return 15:23 sched spa_sync:entry 15:23 sched 76461 <- spa_sync() dpool uberblock_update:entry 15:23 75007 122 76461 <- updating dpool. txg != blk_birth! uberblock_update:return 15:23 85 0 <- args[1]=0: no sync! spa_sync:return 15:23 sched (created new file in rpool, using the touch command)

(done creating file)
spa_sync:entry 15:53 sched 76462 <- spa_sync() dpool uberblock_update:entry 15:53 75007 122 76462 <- updating dpool. txg != blk_birth! uberblock_update:return 15:53 85 0 <- args[1]=0: no sync! spa_sync:return 15:53 sched spa_sync:entry 15:53 sched 69708 <- spa_sync() dpool-2 uberblock_update:entry 15:53 69704 178 69708 <- updating dpool-2. txg != blk_birth! uberblock_update:return 15:53 85 0 <- args[1]=0: no sync! spa_sync:return 15:53 sched spa_sync:entry 15:53 sched 41543 <- spa_sync() rpool bdev_strategy:start 15:53 sched 25571658 15360 bdev_strategy:start 15:53 sched 25571688 1024 . ... some more IO of various blocks and sizes . bdev_strategy:start 15:53 sched 25671855 4096 bdev_strategy:start 15:53 sched 25682908 2048 uberblock_update:entry 15:53 41543 159 41543 <- updating rpool. txg == blk_birth! uberblock_update:return 15:53 85 1 <- args[1]=1: sync! vdev_uberblock_sync:entry 15:53 sched bdev_strategy:start 15:53 sched 398 1024 <- write to block #398! bdev_strategy:start 15:53 sched 910 1024 bdev_strategy:start 15:53 sched 39005070 1024 bdev_strategy:start 15:53 sched 39005582 1024 spa_sync:return 15:54 sched

I think this is a very nice example of what you can find out with DTrace. You may ask, "Why did the rpool graph at the top of this blog entry show an update in rpool 2 times per minute whereas the DTrace output shows an update not that often (which, by the way, would further extend the life of the USB flash drive…)?" This is because the output of the zdb script was written to file in /var/tmp every minute, causing a change in rpool more often than during the run time of the DTrace script!

ZFS and the uberblock

January 6, 2009

Inspired by Constantin‘s comment on USB sticks wearing out Matthias‘s blog entry about an eco-friendly home server, I tried to find out more about how and how often the ZFS uberblock is written.

Using DTrace, it’s not that difficult:

We start by finding out which DTrace probes exist for the uberblock:

$ dtrace -l | grep -i uberblock
31726        fbt               zfs            vdev_uberblock_compare entry
31727        fbt               zfs            vdev_uberblock_compare return
31728        fbt               zfs          vdev_uberblock_load_done entry
31729        fbt               zfs          vdev_uberblock_load_done return
31730        fbt               zfs          vdev_uberblock_sync_done entry
31731        fbt               zfs          vdev_uberblock_sync_done return
31732        fbt               zfs               vdev_uberblock_sync entry
31733        fbt               zfs               vdev_uberblock_sync return
34304        fbt               zfs          vdev_uberblock_sync_list entry
34305        fbt               zfs          vdev_uberblock_sync_list return
34404        fbt               zfs                  uberblock_update entry
34405        fbt               zfs                  uberblock_update return
34408        fbt               zfs                  uberblock_verify entry
34409        fbt               zfs                  uberblock_verify return
34416        fbt               zfs               vdev_uberblock_load entry
34417        fbt               zfs               vdev_uberblock_load return

So there are two probes on uberblock_update: fbt:zfs:uberblock_update:entry and fbt:zfs:uberblock_update:return!

Now we can find out more about it by searching the OpenSolaris sources: When searching for definition of uberblock_update in project onnv, we find one hit for line 49 in file uberblock.c, and when clicking on it, we see:

source extract: line 49 of file uberblock.c

Now, when searching again for the definitions of the first two arguments (args[0| and args[1|) of uberblock_update (which is uberblock and vdev), we get:

For uberblock, the following hits are shown:

When clicking on the link on the definition of struct uberblock (around line 53 in file uberblock_impl.h), we get:

For the members of struct vdev, it’s not that easy. First, we get a long hit list when searching for the definition of vdev in the source browser. But if we search for "struct vdev" in that list, using the browser’s search function, we get:

When clicking on the definition of struct vdev (around line 108 in file vdev_impl.h), we can see all the members of this structure.

Here are all the links, plus one more for struct blkprt (a member of struct uberblock), again in one place:

Now we are prepared to access the data via DTrace, by printing the arguments and members as in the following example:

printf ("%d %d %d", args[0]->ub_timestamp, args[1]->vdev_id, args[2]);

So a sample final DTrace script to print out as much information in the event of an uberblock_update as we can, and also print out any relevant I/O (hoping that from showing both at the same time, we can see where and how often the uberblocks are written):

io:genunix:default_physio:start,
io:genunix:bdev_strategy:start,
io:genunix:biodone:done
{
printf ("%d %s %d %d", timestamp, execname,
args[0]->b_blkno, args[0]->b_bcount);
}
fbt:zfs:uberblock_update:entry
{
printf ("%d %s, %d, %d, %d, %d", timestamp, execname,
pid, args[0]->ub_rootbp.blk_prop, args[1]->vdev_asize, args[2]);
}

The lines for showing the I/O are derived from DTrace scripts for I/O analysis in the DTrace Toolkit

Although I was unable to print out members of struct vdev (the second argument to uberblock_update() ) with the fbt:zfs:uberblock_update:entry probe (I also tried fbt:zfs:uberblock_update:return but had other problems with that one), the results when running this script, using
$ dtrace -s zfs-uberblock-report-02.d
, are quite interesting. Here’s an extract (long lines shortened):

  0  33280  uberblock_update:entry 102523281435514 sched, 0, 922..345, 0, 21005
0   5510     bdev_strategy:start 102523490757174 sched 282 1024
0   5510     bdev_strategy:start 102523490840779 sched 794 1024
0   5510     bdev_strategy:start 102523490873844 sched 18493722 1024
0   5510     bdev_strategy:start 102523490903928 sched 18494234 1024
0   5498            biodone:done 102523491215729 sched 282 1024
0   5498            biodone:done 102523491576878 sched 794 1024
0   5498            biodone:done 102523491873015 sched 18493722 1024
0   5498            biodone:done 102523492232464 sched 18494234 1024
...
0  33280  uberblock_update:entry 102553280316974 sched, 0, 922..345, 0, 21006
0   5510     bdev_strategy:start 102553910907205 sched 284 1024
0   5510     bdev_strategy:start 102553910989248 sched 796 1024
0   5510     bdev_strategy:start 102553911022603 sched 18493724 1024
0   5510     bdev_strategy:start 102553911052733 sched 18494236 1024
0   5498            biodone:done 102553911344640 sched 284 1024
0   5498            biodone:done 102553911623733 sched 796 1024
0   5498            biodone:done 102553911981236 sched 18493724 1024
0   5498            biodone:done 102553912250614 sched 18494236 1024
...
0  33280  uberblock_update:entry 102583279275573 sched, 0, 922..345, 0, 21007
0   5510     bdev_strategy:start 102583540376459 sched 286 1024
0   5510     bdev_strategy:start 102583540459265 sched 798 1024
0   5510     bdev_strategy:start 102583540492968 sched 18493726 1024
0   5510     bdev_strategy:start 102583540522840 sched 18494238 1024
0   5498            biodone:done 102583540814677 sched 286 1024
0   5498            biodone:done 102583541091636 sched 798 1024
0   5498            biodone:done 102583541406962 sched 18493726 1024
0   5498            biodone:done 102583541743494 sched 18494238 1024

Using the following (n)awk one-liners:

$ nawk '/uberblock/{print}}' zfs-ub-report-02.d.out
$ nawk '/uberblock/{a=0}{a++;if ((a==2)){print}}' zfs-ub-report-02.d.out
$ nawk '/uberblock/{a=0}{a++;if ((a>=1)&&(a<=5)){print}}' zfs-ub-report-02.d.out

, we can print:

  • only the uberblock_update lines, or
  • just the next line after the line that matches the uberblock_update entry, or
  • all 4 lines after that entry, including the entry itself.

When running the script for a while and capturing its output, we can later analyze at which block number the first block after uberblock_update() is written, and we can see that the numbers are always even, the lowest number is 256 and the highest number is 510, with a block size of 1024. Those block numbers always go from 256, 258, 260, and so forth, until they reach 510. Then, they start with 256 again. So every (510-256)/2+1 = 128th iteration (yes, it’s one more, as we have to include the first element after subtracting the first from the last element), the first block is overwritten again. The same is true for blocks 768…1022, 18493696…18493950 and 18494208…18494462 (the third and fourth block ranges should be different for different zpool sizes).

Now that we understand how and in which order the uberblocks are written, we are prepared to examine after how many days the uberblock area of a USB stick without wear leveling would probably be worn out. More on that and how we can use zdb for that, in my next blog entry.

Some more links on this topic: