Archive for the ‘Solaris’ Category

ZFS and the uberblock, part 2

January 9, 2009

In my last blog entry, I described how to use DTrace to learn more about how and where the ZFS uberblock is written. I showed that an uberblock update causes 4 writes to 4 different areas, and only after 128 write cycles will an old uberblock be overwritten by a new one. Now, using zdb, I will show how often various zpools update their uberblocks so we can examine after which time a USB flash without wear leveling would probably be worn out.

Using zdb -uuuv, as in the following script:

#!/bin/ksh
TMPDIR=/var/tmp
LOGFILE=${TMPDIR}/ub-stats.out
echo "-----" >> ${LOGFILE}
date +"date+time: %Y-%m-%d %H:%M:%S" >> ${LOGFILE}
for ZPOOL in $(zpool list -o name -H); do
echo ${ZPOOL} >> ${LOGFILE}
zdb -uuuv ${ZPOOL} >> ${LOGFILE}
done

, we can show important information about the uberblocks of each zpool and write it to a file (for example once per minute, per crontab). A timestamp is useful for further analysis.

After some some hours or days, we can extract the data of certain zpools, import it into a spreadsheet and create graphs to show how often new uberblocks are written to disk. Here’s how we can print the data for rpool:

nawk '/date/{date=$2; time=$3}
/pool/{pool=$NF}
/txg/{printf ("%s %s\t%s\t%s\n", date, time, pool, $NF)}' ub-stats.out

A sample output might look like:

2009-01-06 22:14:00     dpool   66273
2009-01-06 22:14:00     dpool-2 58707
2009-01-06 22:14:00     rpool   33165
2009-01-06 22:15:00     dpool   66273
2009-01-06 22:15:00     dpool-2 58707
2009-01-06 22:15:00     rpool   33167
2009-01-06 22:16:00     dpool   66273
2009-01-06 22:16:00     dpool-2 58707
2009-01-06 22:16:00     rpool   33169
2009-01-06 22:17:00     dpool   66273
2009-01-06 22:17:00     dpool-2 58707
2009-01-06 22:17:00     rpool   33171

We can easily extract just one pool, using nawk ‘$3=="dpool"{print}’, for example, read the resulting file into OpenOffice, and create a graph of type x/y to show the txg numbers over time. For rpool, I got the following picture:

txg over time (rpool)

The data (you can even see it in the above table) shows that for rpool, txg is incremented at about two times per minute. Each additional manual "sync" command increments txg by one and also writes the uberblock to disk. Assuming that a USB flash drive without wear leveling wears out after data is written to the same location 10.000 times (which is the minimum guaranteed number of cycles for a standard USB flash drive mentioned in this document), knowing that the same block is overwritten 2*60*24/128 = less than 23 times after one day, we can calculate 10.000/22.5 to get the number of days until such a USB stick would be worn out. The result is about 444 days (=1 year 79 days). Assuming the USB stick is worn out only after 100.000 writes to the same location, the result would be 4444 days which is more than 12 years! At the end of this blog entry, I will explain why the USB stick might even live much longer than those 444 or 4444 days.

Interestingly, for another pool, name dpool, I got a completely different graph:

txg over time (dpool)

What I know is that on this pool, at about 06:00 am, a small number of files (about 10 to 20) is been created, again at 07:00 am, and during other times, there is no or not much write activity. Using another dtrace script, we can watch in real time and see that the uberblock’s txg number is indeed increased every 30 seconds but only written to disk when there is a change to the pool to which it belongs or, for example, when a sync command is issued on the command line. Here’s the script I used (a colleague recommended to also include spa_sync as this is the call for synching a transaction group, so I added two more probes for watching these events):

io:genunix:bdev_strategy:start
{
printf ("%Y %s %d %d", walltimestamp, execname, args[0]->b_blkno, args[0]->b_bcount);
}
fbt:zfs:spa_sync:entry
{
printf ("%Y %s %d", walltimestamp, execname, args[1]);
}
fbt:zfs:spa_sync:return,
fbt:zfs:vdev_uberblock_sync:entry
{
printf ("%Y %s", walltimestamp, execname);
}
fbt:zfs:uberblock_update:entry
{
printf ("%Y %d %d %d", walltimestamp,
args[0]->ub_rootbp.blk_birth, args[0]->ub_rootbp.blk_fill,
args[2]);
}
fbt:zfs:uberblock_update:return
{
printf ("%Y %d %d", walltimestamp, args[0], args[1]);
}

And here’s a sample output. I removed certain columns, including the hour (19 p.m.) from the original output. The times are minutes:seconds. For better readability, I have marked all entries and comments regarding writing the uberblock update for rpool to disk in green color, and all actions I performed in another terminal window while the dtrace script was running in red color.

           spa_sync:entry 11:09 sched 41533           <- spa_sync() rpool
bdev_strategy:start 11:09 sched 25671451 512
bdev_strategy:start 11:09 sched 4310260 512
bdev_strategy:start 11:09 sched 25571476 3072
bdev_strategy:start 11:09 sched 25671452 14848
spa_sync:entry 11:09 sched 69698           <- spa_sync() dpool-2
uberblock_update:entry 11:09 67082 178 69698       <- updating dpool-2. txg != blk_birth!
uberblock_update:return 11:09 85 0                  <- args[1]=0: no sync!
spa_sync:return 11:09 sched
spa_sync:entry 11:09 sched 76452           <- spa_sync() dpool
uberblock_update:entry 11:09 75007 122 76452       <- updating dpool. Note that txg is 76452
whereas blk_birth is 75007: means that
this file system has not been changed for
(76452-75007)/2+1 = 723m 30s = 12h 3m 30s
So the last change to dpool was
at about 7:08 am. This is exactly the time
stamp as shown in zdb -uuuv dpool. And the
txg number shown in that output is the one
that is read from disk, so it is 75007:
(zdb output follows; dtrace output will be continued below)
---
$ zdb -uuuv dpool
Uberblock
magic = 0000000000bab10c
version = 13
txg = 75007
guid_sum = 10502025224250020770
timestamp = 1231481321 UTC = Fri Jan  9 07:08:41 2009
rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:fd8c54a00:200> DVA[1]=<0:1f8b7f8800:200> DVA[2]=<0:c02e0600:200> fletcher4 lzjb LE contiguous birth=75007 fill=122 cksum=a0e45aac0:42a5af596ce:e0b8e86ece60:201ef17f0b98da
---
(dtrace output continued)
uberblock_update:return 11:09 85 0                  <- args[1]=0: no sync!
spa_sync:return 11:09 sched
bdev_strategy:start 11:09 sched 4310261 27136
bdev_strategy:start 11:09 sched 25571482 15360
.
... some more IO of various blocks and sizes
.
bdev_strategy:start 11:09 sched 25671511 512
bdev_strategy:start 11:09 sched 25671517 5632
uberblock_update:entry 11:09 41533 159 41533       <- updating rpool. txg == blk_birth!
uberblock_update:return 11:09 85 1                  <- args[1]=1: sync!
vdev_uberblock_sync:entry 11:09 sched
      bdev_strategy:start 11:09 sched 378 1024        <- write to block #378!
bdev_strategy:start 11:09 sched 890 1024 bdev_strategy:start 11:09 sched 39005050 1024 bdev_strategy:start 11:09 sched 39005562 1024 spa_sync:return 11:09 sched spa_sync:entry 11:39 sched 41534 <- spa_sync() rpool uberblock_update:entry 11:39 41533 159 41534 <- updating rpool. txg != blk_birth! uberblock_update:return 11:39 85 0 <- args[1]=0: no sync! skip writing to block #380!
spa_sync:return 11:39 sched spa_sync:entry 11:39 sched 76453 <- spa_sync() dpool uberblock_update:entry 11:39 75007 122 76453 uberblock_update:return 11:39 85 0 spa_sync:return 11:39 sched spa_sync:entry 11:39 sched 69699 <- spa_sync() dpool-2 uberblock_update:entry 11:39 67082 178 69699 uberblock_update:return 11:39 85 0 spa_sync:return 11:39 sched spa_sync:entry 12:09 sched 69700 <- spa_sync() dpool-2 uberblock_update:entry 12:09 67082 178 69700 uberblock_update:return 12:09 85 0 spa_sync:return 12:09 sched spa_sync:entry 12:09 sched 76454 <- spa_sync() dpool uberblock_update:entry 12:09 75007 122 76454 uberblock_update:return 12:09 85 0 spa_sync:return 12:09 sched spa_sync:entry 12:09 sched 41535 <- spa_sync() rpool bdev_strategy:start 12:09 sched 25571530 1024 bdev_strategy:start 12:09 sched 4310376 1024 . ... some more IO of various blocks and sizes . bdev_strategy:start 12:09 sched 25671620 512 bdev_strategy:start 12:09 sched 25671626 5120 uberblock_update:entry 12:09 41535 159 41535 <- updating rpool. txg == blk_birth! uberblock_update:return 12:09 85 1 <- args[1]=1: sync! vdev_uberblock_sync:entry 12:09 sched bdev_strategy:start 12:09 sched 382 1024 <- write to block #382!
bdev_strategy:start 12:09 sched 894 1024 bdev_strategy:start 12:09 sched 39005054 1024 bdev_strategy:start 12:09 sched 39005566 1024 spa_sync:return 12:09 sched
(sync command issued in another terminal window)

(end sync command)
spa_sync:entry 12:23 sched 76455 <- spa_sync() dpool uberblock_update:entry 12:23 75007 122 76455 uberblock_update:return 12:23 85 0 spa_sync:return 12:23 sched spa_sync:entry 12:23 sched 69701 <- spa_sync() dpool-2 uberblock_update:entry 12:23 67082 178 69701 uberblock_update:return 12:23 85 0 spa_sync:return 12:23 sched spa_sync:entry 12:23 sched 41536 <- spa_sync() rpool bdev_strategy:start 12:23 sched 25671636 1536 bdev_strategy:start 12:23 sched 25671639 2560 . ... some more IO of various blocks and sizes . bdev_strategy:start 12:23 sched 25671718 3584 bdev_strategy:start 12:23 sched 25682884 2048 uberblock_update:entry 12:23 41536 159 41536 <- updating rpool. txg == blk_birth! uberblock_update:return 12:23 85 1 <- args[1]=1: sync! vdev_uberblock_sync:entry 12:23 sched bdev_strategy:start 12:23 sched 384 1024 <- write to next block, #384!
bdev_strategy:start 12:23 sched 896 1024 bdev_strategy:start 12:23 sched 39005056 1024 bdev_strategy:start 12:23 sched 39005568 1024 spa_sync:return 12:23 sched bdev_strategy:start 12:53 sched 234913576 12288 bdev_strategy:start 12:53 sched 234910656 16384 spa_sync:entry 12:53 sched 41537 <- spa_sync() rpool bdev_strategy:start 12:53 sched 25671725 14848 bdev_strategy:start 12:53 sched 25571590 1024 bdev_strategy:start 12:53 sched 4310548 1024 bdev_strategy:start 12:53 sched 25571588 1024 spa_sync:entry 12:53 sched 69702 <- spa_sync() dpool-2 bdev_strategy:start 12:53 sched 234915243 512 bdev_strategy:start 12:53 sched 306224021 512 bdev_strategy:start 12:53 sched 234913640 12288 bdev_strategy:start 12:53 sched 234915250 1024 spa_sync:entry 12:53 sched 76456 <- spa_sync() dpool uberblock_update:entry 12:53 75007 122 76456 uberblock_update:return 12:53 85 0 spa_sync:return 12:53 sched bdev_strategy:start 12:53 sched 234915252 1536 bdev_strategy:start 12:53 sched 234915256 7168 . ... some more IO of various blocks and sizes . bdev_strategy:start 12:53 sched 379625154 1536 bdev_strategy:start 12:53 sched 379625160 4608 uberblock_update:entry 12:53 41537 159 41537 <- updating rpool. txg == blk_birth! uberblock_update:return 12:53 85 1 <- args[1]=1: sync! uberblock_update:entry 12:53 69702 178 69702 uberblock_update:return 12:53 85 1 <- args[1]=1: sync! vdev_uberblock_sync:entry 12:53 sched bdev_strategy:start 12:53 sched 396 1024 bdev_strategy:start 12:53 sched 908 1024 bdev_strategy:start 12:53 sched 390704524 1024 bdev_strategy:start 12:53 sched 390705036 1024 spa_sync:return 12:53 sched vdev_uberblock_sync:entry 12:54 sched bdev_strategy:start 12:54 sched 386 1024 <- write to next block, #386
bdev_strategy:start 12:54 sched 898 1024 bdev_strategy:start 12:54 sched 39005058 1024 bdev_strategy:start 12:54 sched 39005570 1024 spa_sync:return 12:54 sched (wrote file in dpool-2, with :w in vi) 0 6668 bdev_strategy:start 2009 Jan 9 19:13:08 sched 234913600 20480 (done writing) spa_sync:entry 13:23 sched 41538 <- spa_sync() rpool uberblock_update:entry 13:23 41537 159 41538 <- updating rpool. txg != blk_birth! uberblock_update:return 13:23 85 0 <- args[1]=0: no sync! skip writing to block #388!
spa_sync:return 13:23 sched spa_sync:entry 13:23 sched 69703 <- spa_sync() dpool-2 bdev_strategy:start 13:23 sched 234915299 512 bdev_strategy:start 13:23 sched 306224077 512 bdev_strategy:start 13:23 sched 234915300 512 bdev_strategy:start 13:23 sched 234915302 1024 spa_sync:entry 13:23 sched 76457 <- spa_sync() dpool uberblock_update:entry 13:23 75007 122 76457 uberblock_update:return 13:23 85 0 spa_sync:return 13:23 sched bdev_strategy:start 13:23 sched 234915304 1536 bdev_strategy:start 13:23 sched 234915308 7168 . ... some more IO of various blocks and sizes . bdev_strategy:start 13:23 sched 379625170 6144 bdev_strategy:start 13:23 sched 379625184 6144 uberblock_update:entry 13:23 69703 178 69703 <- updating dpool-2. txg == blk_birth! uberblock_update:return 13:23 85 1 <- args[1]=1: sync! vdev_uberblock_sync:entry 13:23 sched bdev_strategy:start 13:23 sched 398 1024 bdev_strategy:start 13:23 sched 910 1024 bdev_strategy:start 13:23 sched 390704526 1024 bdev_strategy:start 13:23 sched 390705038 1024 spa_sync:return 13:23 sched (rm file in dpool-2)

(done rm) spa_sync:entry 13:53 sched 69704 <- spa_sync() dpool-2 bdev_strategy:start 13:53 sched 234915348 512 bdev_strategy:start 13:53 sched 306224124 512 bdev_strategy:start 13:53 sched 234915364 1024 bdev_strategy:start 13:53 sched 306224140 1024 spa_sync:entry 13:53 sched 76458 <- spa_sync() dpool uberblock_update:entry 13:53 75007 122 76458 uberblock_update:return 13:53 85 0 spa_sync:return 13:53 sched spa_sync:entry 13:53 sched 41539 <- spa_sync() rpool bdev_strategy:start 13:53 sched 25571606 15360 bdev_strategy:start 13:53 sched 25571642 1024 . ... some more IO of various blocks and sizes . bdev_strategy:start 13:53 sched 379625196 6144 bdev_strategy:start 13:53 sched 379625210 6144 uberblock_update:entry 13:53 69704 178 69704 <- updating dpool-2. txg == blk_birth! uberblock_update:return 13:53 85 1 <- args[1]=1: sync! bdev_strategy:start 13:53 sched 25671818 2560 bdev_strategy:start 13:53 sched 4310632 2560 . ... some more IO of various blocks and sizes . bdev_strategy:start 13:53 sched 25671823 3584 bdev_strategy:start 13:53 sched 25682892 2048 uberblock_update:entry 13:53 41539 159 41539 <- updating rpool. txg == blk_birth! uberblock_update:return 13:53 85 1 <- args[1]=1: sync! vdev_uberblock_sync:entry 13:53 sched bdev_strategy:start 13:53 sched 400 1024 bdev_strategy:start 13:53 sched 912 1024 bdev_strategy:start 13:53 sched 390704528 1024 bdev_strategy:start 13:53 sched 390705040 1024 spa_sync:return 13:53 sched vdev_uberblock_sync:entry 13:54 sched bdev_strategy:start 13:54 sched 390 1024 <- write to block #390
bdev_strategy:start 13:54 sched 902 1024 bdev_strategy:start 13:54 sched 39005062 1024 bdev_strategy:start 13:54 sched 39005574 1024 spa_sync:return 13:54 sched spa_sync:entry 14:23 sched 41540 <- spa_sync() rpool uberblock_update:entry 14:23 41539 159 41540 <- updating rpool. txg != blk_birth! uberblock_update:return 14:23 85 0 <- args[1]=0: no sync! skip writing to block #392!
spa_sync:return 14:23 sched spa_sync:entry 14:23 sched 69705 <- spa_sync() dpool-2 uberblock_update:entry 14:23 69704 178 69705 <- updating dpool-2. txg != blk_birth! uberblock_update:return 14:23 85 0 <- args[1]=0: no sync! spa_sync:return 14:23 sched spa_sync:entry 14:23 sched 76459 <- spa_sync() dpool uberblock_update:entry 14:23 75007 122 76459 <- updating dpool. txg != blk_birth! uberblock_update:return 14:23 85 0 <- args[1]=0: no sync! spa_sync:return 14:23 sched spa_sync:entry 14:53 sched 76460 <- spa_sync() dpool uberblock_update:entry 14:53 75007 122 76460 <- updating dpool. txg != blk_birth! uberblock_update:return 14:53 85 0 <- args[1]=0: no sync! spa_sync:return 14:53 sched spa_sync:entry 14:53 sched 69706 <- spa_sync() dpool-2 uberblock_update:entry 14:53 69704 178 69706 <- updating dpool-2. txg != blk_birth! uberblock_update:return 14:53 85 0 <- args[1]=0: no sync! spa_sync:return 14:53 sched spa_sync:entry 14:53 sched 41541 <- spa_sync() rpool uberblock_update:entry 14:53 41539 159 41541 <- updating rpool. txg != blk_birth! uberblock_update:return 14:53 85 0 <- args[1]=0: no sync! skip writing to block #394!
spa_sync:return 14:53 sched spa_sync:entry 15:23 sched 41542 <- spa_sync() rpool uberblock_update:entry 15:23 41539 159 41542 <- updating rpool. txg != blk_birth! uberblock_update:return 15:23 85 0 <- args[1]=0: no sync! skip writing to block #396!
spa_sync:return 15:23 sched spa_sync:entry 15:23 sched 69707 <- spa_sync() dpool-2 uberblock_update:entry 15:23 69704 178 69707 <- updating dpool-2. txg != blk_birth! uberblock_update:return 15:23 85 0 <- args[1]=0: no sync! spa_sync:return 15:23 sched spa_sync:entry 15:23 sched 76461 <- spa_sync() dpool uberblock_update:entry 15:23 75007 122 76461 <- updating dpool. txg != blk_birth! uberblock_update:return 15:23 85 0 <- args[1]=0: no sync! spa_sync:return 15:23 sched (created new file in rpool, using the touch command)

(done creating file)
spa_sync:entry 15:53 sched 76462 <- spa_sync() dpool uberblock_update:entry 15:53 75007 122 76462 <- updating dpool. txg != blk_birth! uberblock_update:return 15:53 85 0 <- args[1]=0: no sync! spa_sync:return 15:53 sched spa_sync:entry 15:53 sched 69708 <- spa_sync() dpool-2 uberblock_update:entry 15:53 69704 178 69708 <- updating dpool-2. txg != blk_birth! uberblock_update:return 15:53 85 0 <- args[1]=0: no sync! spa_sync:return 15:53 sched spa_sync:entry 15:53 sched 41543 <- spa_sync() rpool bdev_strategy:start 15:53 sched 25571658 15360 bdev_strategy:start 15:53 sched 25571688 1024 . ... some more IO of various blocks and sizes . bdev_strategy:start 15:53 sched 25671855 4096 bdev_strategy:start 15:53 sched 25682908 2048 uberblock_update:entry 15:53 41543 159 41543 <- updating rpool. txg == blk_birth! uberblock_update:return 15:53 85 1 <- args[1]=1: sync! vdev_uberblock_sync:entry 15:53 sched bdev_strategy:start 15:53 sched 398 1024 <- write to block #398! bdev_strategy:start 15:53 sched 910 1024 bdev_strategy:start 15:53 sched 39005070 1024 bdev_strategy:start 15:53 sched 39005582 1024 spa_sync:return 15:54 sched

I think this is a very nice example of what you can find out with DTrace. You may ask, "Why did the rpool graph at the top of this blog entry show an update in rpool 2 times per minute whereas the DTrace output shows an update not that often (which, by the way, would further extend the life of the USB flash drive…)?" This is because the output of the zdb script was written to file in /var/tmp every minute, causing a change in rpool more often than during the run time of the DTrace script!

ZFS and the uberblock

January 6, 2009

Inspired by Constantin‘s comment on USB sticks wearing out Matthias‘s blog entry about an eco-friendly home server, I tried to find out more about how and how often the ZFS uberblock is written.

Using DTrace, it’s not that difficult:

We start by finding out which DTrace probes exist for the uberblock:

$ dtrace -l | grep -i uberblock
31726        fbt               zfs            vdev_uberblock_compare entry
31727        fbt               zfs            vdev_uberblock_compare return
31728        fbt               zfs          vdev_uberblock_load_done entry
31729        fbt               zfs          vdev_uberblock_load_done return
31730        fbt               zfs          vdev_uberblock_sync_done entry
31731        fbt               zfs          vdev_uberblock_sync_done return
31732        fbt               zfs               vdev_uberblock_sync entry
31733        fbt               zfs               vdev_uberblock_sync return
34304        fbt               zfs          vdev_uberblock_sync_list entry
34305        fbt               zfs          vdev_uberblock_sync_list return
34404        fbt               zfs                  uberblock_update entry
34405        fbt               zfs                  uberblock_update return
34408        fbt               zfs                  uberblock_verify entry
34409        fbt               zfs                  uberblock_verify return
34416        fbt               zfs               vdev_uberblock_load entry
34417        fbt               zfs               vdev_uberblock_load return

So there are two probes on uberblock_update: fbt:zfs:uberblock_update:entry and fbt:zfs:uberblock_update:return!

Now we can find out more about it by searching the OpenSolaris sources: When searching for definition of uberblock_update in project onnv, we find one hit for line 49 in file uberblock.c, and when clicking on it, we see:

source extract: line 49 of file uberblock.c

Now, when searching again for the definitions of the first two arguments (args[0| and args[1|) of uberblock_update (which is uberblock and vdev), we get:

For uberblock, the following hits are shown:

When clicking on the link on the definition of struct uberblock (around line 53 in file uberblock_impl.h), we get:

For the members of struct vdev, it’s not that easy. First, we get a long hit list when searching for the definition of vdev in the source browser. But if we search for "struct vdev" in that list, using the browser’s search function, we get:

When clicking on the definition of struct vdev (around line 108 in file vdev_impl.h), we can see all the members of this structure.

Here are all the links, plus one more for struct blkprt (a member of struct uberblock), again in one place:

Now we are prepared to access the data via DTrace, by printing the arguments and members as in the following example:

printf ("%d %d %d", args[0]->ub_timestamp, args[1]->vdev_id, args[2]);

So a sample final DTrace script to print out as much information in the event of an uberblock_update as we can, and also print out any relevant I/O (hoping that from showing both at the same time, we can see where and how often the uberblocks are written):

io:genunix:default_physio:start,
io:genunix:bdev_strategy:start,
io:genunix:biodone:done
{
printf ("%d %s %d %d", timestamp, execname,
args[0]->b_blkno, args[0]->b_bcount);
}
fbt:zfs:uberblock_update:entry
{
printf ("%d %s, %d, %d, %d, %d", timestamp, execname,
pid, args[0]->ub_rootbp.blk_prop, args[1]->vdev_asize, args[2]);
}

The lines for showing the I/O are derived from DTrace scripts for I/O analysis in the DTrace Toolkit

Although I was unable to print out members of struct vdev (the second argument to uberblock_update() ) with the fbt:zfs:uberblock_update:entry probe (I also tried fbt:zfs:uberblock_update:return but had other problems with that one), the results when running this script, using
$ dtrace -s zfs-uberblock-report-02.d
, are quite interesting. Here’s an extract (long lines shortened):

  0  33280  uberblock_update:entry 102523281435514 sched, 0, 922..345, 0, 21005
0   5510     bdev_strategy:start 102523490757174 sched 282 1024
0   5510     bdev_strategy:start 102523490840779 sched 794 1024
0   5510     bdev_strategy:start 102523490873844 sched 18493722 1024
0   5510     bdev_strategy:start 102523490903928 sched 18494234 1024
0   5498            biodone:done 102523491215729 sched 282 1024
0   5498            biodone:done 102523491576878 sched 794 1024
0   5498            biodone:done 102523491873015 sched 18493722 1024
0   5498            biodone:done 102523492232464 sched 18494234 1024
...
0  33280  uberblock_update:entry 102553280316974 sched, 0, 922..345, 0, 21006
0   5510     bdev_strategy:start 102553910907205 sched 284 1024
0   5510     bdev_strategy:start 102553910989248 sched 796 1024
0   5510     bdev_strategy:start 102553911022603 sched 18493724 1024
0   5510     bdev_strategy:start 102553911052733 sched 18494236 1024
0   5498            biodone:done 102553911344640 sched 284 1024
0   5498            biodone:done 102553911623733 sched 796 1024
0   5498            biodone:done 102553911981236 sched 18493724 1024
0   5498            biodone:done 102553912250614 sched 18494236 1024
...
0  33280  uberblock_update:entry 102583279275573 sched, 0, 922..345, 0, 21007
0   5510     bdev_strategy:start 102583540376459 sched 286 1024
0   5510     bdev_strategy:start 102583540459265 sched 798 1024
0   5510     bdev_strategy:start 102583540492968 sched 18493726 1024
0   5510     bdev_strategy:start 102583540522840 sched 18494238 1024
0   5498            biodone:done 102583540814677 sched 286 1024
0   5498            biodone:done 102583541091636 sched 798 1024
0   5498            biodone:done 102583541406962 sched 18493726 1024
0   5498            biodone:done 102583541743494 sched 18494238 1024

Using the following (n)awk one-liners:

$ nawk '/uberblock/{print}}' zfs-ub-report-02.d.out
$ nawk '/uberblock/{a=0}{a++;if ((a==2)){print}}' zfs-ub-report-02.d.out
$ nawk '/uberblock/{a=0}{a++;if ((a>=1)&&(a<=5)){print}}' zfs-ub-report-02.d.out

, we can print:

  • only the uberblock_update lines, or
  • just the next line after the line that matches the uberblock_update entry, or
  • all 4 lines after that entry, including the entry itself.

When running the script for a while and capturing its output, we can later analyze at which block number the first block after uberblock_update() is written, and we can see that the numbers are always even, the lowest number is 256 and the highest number is 510, with a block size of 1024. Those block numbers always go from 256, 258, 260, and so forth, until they reach 510. Then, they start with 256 again. So every (510-256)/2+1 = 128th iteration (yes, it’s one more, as we have to include the first element after subtracting the first from the last element), the first block is overwritten again. The same is true for blocks 768…1022, 18493696…18493950 and 18494208…18494462 (the third and fourth block ranges should be different for different zpool sizes).

Now that we understand how and in which order the uberblocks are written, we are prepared to examine after how many days the uberblock area of a USB stick without wear leveling would probably be worn out. More on that and how we can use zdb for that, in my next blog entry.

Some more links on this topic:

A compact primer on RBAC

January 5, 2009

A nice and compact primer on RBAC (Role-Based Access Control), with some examples for a quick start, is available here.

ZFS on external disks: Why free disk space might not increase when deleting files…

January 2, 2009

Interesting: After copying about 30,000 files from an internal disk (OpenSolaris 2008.11) to a new directory of an external USB disk and later removing a similar number of files, the free disk space as reported by zpool status or in the column “Available” of df -k did not increase! And the df -k output showed that the number of total blocks of the only file system on that disk had decreased! What had happened?

Here’s the explanation:

For backup purposes, I wanted to use an external 2.5″ USB disk with a capacity of about 186GB. On that USB disk, I had created a zpool, using:
$ zpool create -f dpool-2 c5t0d0

For a first backup, I copied the mentioned 30,000 files with a total size of about 75GB from one system to a USB disk connected to another server (running OpenSolaris 2008.11 as well) via the network. I used a simple scp -pr which preserves modes and times but not uids and gids. One possible procedude for preserving file ownership as well, as mentioned here, for example, would be:
cd source_dir; tar -cf - .|ssh user@targethost "cd target_dir; tar -xvf -".

Anyway, I thought it would be a good idea to export that USB disk and import it on another server (also on OpenSolaris 2008.11) to check if there were any problems. I also wanted to run a
$ zpool scrub dpool-2
to verify the data integrity. Everything went smoothly.

OK. So I exported the zpool again, connected the disk to the server on which the original data is located, imported the zpool and started copying the same data which I had copied before (via the network) locally, using
$ tar -cf - | ( cd /dpool-2; tar -xpf - )

As this finished without errors as well, and a comparison using /usr/bin/diff or the default /usr/gnu/bin/diff only showed a problem with one of the files that had an interesting character in its file name (the problem went away after changing LANG from en_US.UTF-8 to C), I decided to remove the old 30,000 files (the ones which I had copied via the network). I had removed files on ZFS before, using rm -rf, and it always was amazingly fast, compared to PCFS or UFS. But this time, removing those files took quite a while, with some disk activity.

After all the “old” files were removed, I discovered the following:

  • Free space as in df -k or zpool status dropped from 112GB to 36GB!
  • Used space as in zpool status increased from 75GB to 150GB!
  • df -k reported a total size of the only file system on that zpool of just 108GB, compared to 183GB as it was reported before!

The solution was to execute the following command:
$ zfs list -t snapshot
The output showed several snapshots, taken during the time the disk was connected to the second server (on which I had set up time slider for creating automated snapshots). When I connected the disk to the final server, the snapshots were still there, and when removing files, the snapshots were still valid so it would have been possible to create the deleted files again. So the easy solution for recreating the free space again was just to remove all those snapshots, using zfs destroy, as in the following example:
$ zfs destroy dpool-2@zfs-auto-snap:frequent-2009-01-02-18:45
After the final snapshot for dpool-2 was removed, things showed up in df -k and zpool status as expected. “Problem” solved!

A great start into the new year 2009!

December 31, 2008

I am sure you agree – at least after watching the video on Brendan’s blog entry about I/O performance analysis on the Sun Storage 7000 series. The direct link is here.

Happy New Year to everyone!

Multibooting OpenSolaris 2008.11, Kubuntu 8.10, and Windows XP (one disk)

December 30, 2008

A while after I published my blog entries on Multibooting Solaris and Windows XP and on Multibooting Solaris, Ubuntu, and Windows XP, I thought it was time for some instructions on multibooting OpenSolaris 2008.11, Kubuntu 8.10, and Windows XP.

On my laptop, a Fujitsu-Siemens S6120D (Pentium M 1.4 GHz, 512 MB RAM, 40 GB IDE disk), I wanted to replace my old Solaris Nevada installation by the new OpenSolaris 2008.11, and then also an old Ubuntu 6.10 by Kubuntu 8.10.

After the OpenSolaris installation, I could boot into OpenSolaris and Windows XP just fine. But when I installed Kubuntu (using ext3 as the / filesystem), with the default install options, the grub that had been installed by OpenSolaris had been overwritten by Kubutu’s grub. Even after adding old OpenSolaris entries to the new grub, booting into OpenSolaris failed. After installing OpenSolaris again, that problem was solved, but I could not boot into Kubuntu! The following message was shown on screen:

Booting 'Ubuntu 8.10, kernel 2.6.27-7-generic'
root            (hd0,6)
Filesystem type is ext2fs, partition type 0x83
kernel          /boot/vmlinuz-2.6.27-7-generic root=UUID=69585a5b-ba20-43da-984c-148c40f82a7e ro quiet splash
Error 2: Bad file or directory type
Press any key to continue...

Apparently, neither of the two grub installations could read the root filesystem of its “opposite” operating system. The solution for this problem in my case was to chainload the Linux grub from the Solaris grub. I used the following procedure after installing OpenSolaris:

  1. Live boot from the Kubuntu CD and install, or use the “install to disk” menu item (2nd menu item in Kubuntu’s entry screen).
  2. Configure all necessary settings, as in the following screen shots:


    Start by choosing the installer’s language:


    Select a city close to your desired time zone:


    Select a keyboard layout:


    In the Prepare disk space screen, a proposal like the following might be shown:


    Select the manual partitioning feature to overwrite the old Linux partition:


    The Prepare partitions screen is shown, for example:


    Customize that screen according to your needs, like the following:


    Enter data into all required fields of the Who are you? screen:


    The Ready to install screen is shown. Please DO NOT CLICK on the Install button at this point in time. Instead, click on Advanced:


    This will show up the following screen:


    Change the default Device for boot loader installation to the device name of the Linux partition as shown in the partitioning screen (in my case /dev/sda7). This is to install the boot manager not in the master boot record but in the partition in which Kubuntu is installed.
    Press OK to continue, and then Install in the screen that was shown before (Step 6 of 6). This will install Kubuntu.

  3. After the installation has finished, reboot into OpenSolaris and change file
    /rpool/boot/grub/menu.lst (yes, that’s the location of menu.lst in OpenSolaris 2008.11) as follows:
    Add the following lines to boot from the OpenSolaris grub into the Ubuntu grub:
    menu Kubuntu 8.10
    root (hd0,6)
    chainloader +1

    Note that (hd0,6) corresponds to partition number 7 of the first disk (=device /dev/sda7 in my case).

  4. Reboot. You should now be able to boot into Kubuntu’s grub and from there into the various menu items configured during the Kubuntu installation (Kubuntu, Kubuntu failsafe/single user, memtest, and even Windows XP).

If it doesn’t work (e.g. if you installed the Kubuntu grub boot manager to the master boot record), you do not have to install Solaris again. Just do the following to overwrite the Kubuntu grub by the Solaris grub:

  1. Boot from the OpenSolaris 2008.11 CD.
  2. After booting has finished, open a terminal window and log in as root (password is “opensolaris”).
  3. Run the following command:
    $ format
    Notice the device name (e.g. c3t0d0). Install the Solaris grub into the master boot record, using the “-m” option of the installgrub command. For the last argument, add s0 to the device name: /usr/sbin/installgrub -m /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c3t0d0s0
    This will install the boot manager into the master boot record.
  4. Reboot. You should now be able to boot into Solaris or Windows.
  5. Boot into Solaris and add the Kubuntu lines as mentioned above.
  6. If the Kubuntu grub had not been installed into the Kubuntu partition before, you might be able to do this from the Kubuntu live CD or a Linux rescue CD (I haven’t tried that yet). If not, install Kubuntu again using the steps described in the first part of this blog entry.

Some remarks:

This is how the two FAT partitions look like after mounting them in OpenSolaris:

/dev/dsk/c3d0p0:1        55958     34547     21411  62% /p/sda1
/dev/dsk/c3d0p0:2      8481752   7093052   1388700  84% /p/sda8

These are the /etc/vfstab entries for mounting the two FAT partitions from OpenSolaris:

/dev/dsk/c3d0p0:1  /dev/rdsk/c3d0p0:1  /p/sda1  pcfs  2  yes  -
/dev/dsk/c3d0p0:2  /dev/rdsk/c3d0p0:2  /p/sda8  pcfs  2  yes  -

This is how my partition table looks like, from a Linux fdisk /dev/sda point of view:

Disk /dev/sda: 40.0 GB, 40007761920 bytes
255 heads, 63 sectors/track, 4864 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0xbb88bb88
Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1           7       56196    6  FAT16
/dev/sda2               8        1224     9775552+  bf  Solaris
/dev/sda3            1225        4864    29238300    5  Extended
/dev/sda5            1225        3100    15068938+   7  HPFS/NTFS
/dev/sda6            3101        3164      514048+  82  Linux swap / Solaris
/dev/sda7            3165        3806     5156833+  83  Linux
/dev/sda8            3807        4864     8498353+   b  W95 FAT32

Please back up all your files before installing a new OS to your system! And as always in my blog, no guarantee, warranty, or similar, for anything! This is just a (hopefully error-free) documentation of how things worked for me. Maybe you’ll find it useful.

OpenSolaris 2008.11: useful packages to install

December 25, 2008

Just a list of OpenSolaris packages I often install (using the pkg install command or the Package Manager, or the “Add More Software” icon) after OpenSolaris 2008.11 installation:

Package name Contents
openoffice OpenOffice
SUNWgnome-img-editor GIMP, the image manipulation program
SUNWvncviewer VNC Viewer
SUNWtvnc Tight VNC
SUNWcurl Curl
SUNWgnome-config-editor The Gnome configuration editor
SUNWj6dev Java 6 SDK
sunstudioexpress Sun Studio (Compiler and Tools)
SUNWgcc GNU Compiler
SUNWgnome-common-devel Gnome Development Tools
SUNWapch22 Apache 2.2 web server
amp AMP (Apache, MySQL, PHP) development kit
amp-dev AMP development cluster
SUNWgnu-mc GNU Midnight Commander

For unattended zones installation, some more packages should be installed if the zone locale is not “C”: This is done in file /usr/lib/brand/ipkg/pkgcreatezone. I added the following lines after line 179:

#
# Get some locale and language packages:
#
pkglist="$pkglist SUNWloc SUNWlang-enUS"

Multiple versions of Firefox packages on one Solaris system

December 25, 2008

Up to now, in cases where I had to use multiple versions of Firefox on the same Solaris system, I installed the latest version as pkgadd, and the other versions as tar, for example in /usr/local. This is because the Solaris contributed builds always are published under the same name (which is probably OK for most users).

But it is also possible to install multiple versions of Firefox (3.0.5, 3.0.4, 2.0.0.19, and so forth) on the same system, after extracting and then changing each package before installation. Here’s how it works:

  1. Expand the pkg.bz2 file to pkg, using bzip2 -dc.
  2. Create a temporary directory where the package content will be stored.
  3. Using the -s pkgadd option, spool (extract) the package into this directory.
  4. In the spool directory, rename the old Firefox directory to a new name (for example firefox to firefox-305).
  5. Change the pkginfo file to the new package name (e.g. MOZfirefox to MOZfirefox-305).
  6. Change the pkgmap file to match the new directoy name and the new size and checksum of the pkginfo file.
  7. Install the renamed package, using pkgadd -d . MOZfirefox-305, for example.

Here’s how it looks like after the installation:

/tmp# ls -ald extr*
drwxr-xr-x 3 root root 188 2008-12-23 22:11 extracted-firefox-20081223-221103
drwxr-xr-x 3 root root 188 2008-12-23 22:15 extracted-firefox-20081223-221529
drwxr-xr-x 3 root root 190 2008-12-24 12:02 extracted-firefox-20081224-120152
/opt/sfw/lib# pkginfo | grep MOZ
application MOZfirefox-20019                firefox 2.0.0.19 for Solaris 10
application MOZfirefox-304                  firefox 3.0.4 for Solaris 11
application MOZfirefox-305                  firefox 3.0.5 for Solaris 11
/opt/sfw/lib# ls -l
drwxr-xr-x 13 root bin 46 2008-12-24 12:02 firefox-20019
drwxr-xr-x 13 root bin 47 2008-12-23 22:15 firefox-304
drwxr-xr-x 13 root bin 47 2008-12-23 22:15 firefox-305

I wrote a script which performs all these steps (except installation). I have tested it on various Solaris releases from Solaris 10 to OpenSolaris. So far, I have not seen any errors.

As always: No guarantee or similar. Please back up your files before using the script.

OpenSolaris 2008.11: More on unattended zones installation and the sysidcfg file

December 22, 2008

After my posting about a new zone for a web server, I did some more tests with the sysidcfg file and found that the following entries work (thanks, Jerry, for sending me a simple sample sysidcfg file):

system_locale=C
works fine.

system_locale=en_US.UTF-8
works after I added the following to file /usr/lib/brand/ipkg/pkgcreatezone in the global zone (after line 179) before starting the installation of the zone:

#
# Get some locale and language packages:
#
pkglist="$pkglist SUNWloc SUNWlang-enUS"

I found out the packages by looking into /usr/share/locale and /usr/lib/locale and searching for locales using (for example):
$ pkg search en_US.UTF-8

An example of a package that contains several languages is: SUNWinstall-l10n.


root_password=(short password string from Solaris 10’s /etc/shadow file)
works fine. The old password string will be put into the new /etc/shadow file of the 2008.11 zone but can later be changed from the command line (using the passwd command) and will then be replaced by a long string.


keyboard=...
is apparently not needed, as the terminal and keyboard works fine for me even without that parameter.


name_service=NONE
works fine.

name_service=DNS {domain_name=domain.TLD
name_server=DNS_IP_ADDR_1,DNS_IP_ADDR_2
search=domain.TLD
}

works fine as well. In case the system cannot verify some of the information, it displays just one confirmation screen (as usual, after configuring the same settings by hand) and after choosing “No” to not enter the information again and using <esc>-2, the rest of the sysidcfg configuration remains intact and will not be asked again, and shortly afterwards the login prompt is displayed.


terminal=xterms
works fine.

Building an energy efficient and virus-free office PC with OpenSolaris 2008.11

December 21, 2008

I chose the Intel D945GCLF board with an Intel Atom processor 230 with a TDP of just 4W. The northbridge/GPU apparently needs more power, so it is cooled with a fan whereas the CPU does not need one. Unfortunately, the fan is really loud. The following picture shows the original configuration with an additional case fan for CPU and RAM:

I looked at various northbridge coolers (for example the Zalman ZM-NBF47, mentioned in this blog entry by Matthias Pfützner) but then found a much cheaper and very elegant solution: I just removed the existing northbridge/GPU fan and installed a standard 80mm quiet fan (about 1000/min) on top of the cooler, using just one of the two cooler holes by mounting the fan with a 3mm screw and nut:

I also added a small part of a plastic packaging to the fan so more air would flow downwards towards the CPU cooler (this reduced the CPU temperature by 2 to 3 degrees Celsius):

The whole PC with 2 GB RAM and DVD-ROM drive needs just 30W which is OK, I think.

For the OpenSolaris 2008.11 installation, my first attempt was to install it on an external 200GB USB drive. But contrary to my installation experience on my Intel D201GLY2A (see my other blog entry), this time the USB ports got renamed at every next cold boot attempt, so I had to try all USB ports until the disk drive was found. Which is really funny if your USB disk has two connectors (an additional one for extra power). The good thing was that the power supply of the case had an additional SATA power outlet. And as the D945GCLF mainboard has two SATA connectors, and my external USB disk was a SATA type disk drive, I could remove the disk from its USB case and install it inside the computer case. As it only has 5.25″ and 3.5″ drive bays, I attached an additional u-shaped sheet of metal in the 5.25/3.5″ drive bay so that the 2.5″ disk is supported from below and has a solid, flat metal surface to lay on, and the two mounting screws on just one side of the disk are sufficient to keep it in place:

Installation went smoothly from the OpenSolaris 2008.11 CD. Network and Sound worked right from the start. Up to now, I only had a problem with printing from Firefox: The fonts were either pixel fonts with a very coarse resolution or the font hinting was not implemented (the spaces between characters did not look correctly). I have not yet analyzed that problem so far.