Economical Shared Home Directories with Solaris and ZFS

At the $DAYJOB, we have a cluster of build systems that we use to test source trees on various platforms, including FreeBSD, Linux and Solaris. Some of these machines are fairly old, and have relatively puny hard drives by current standards, yet they continue to do their job just fine. We’ve simply begun to run low on local storage, as the source trees grow and developers need to work with multiple branch and tag checkouts.

Rather than try to boost local storage with extra drives (not a surefire solution everywhere), we focused our attention on a shared storage solution. The idea was to reuse some decommissioned hardware and create enough storage for every build system to be able to mount its home directories and give developers the space they need to do their work. We’d been anticipating the availability of Sun’s new ZFS filesystem in the 6/06 release of Solaris 10. It’s a promising solution, offering superior performance and reliability without the need for expensive RAID hardware. Oh, and we get built-in compression too, which maximizes our use of available space. All was not roses, however, as we ran into a pretty significant performance issue, but overall, it’s been a success. More on that below.

The hardware we had on hand was a modest 1U dual-Xeon server with a single internal ATA disk, and an empty Sun D1000 SCSI shelf with 12 drive bays. The D1000 has two differential SCSI channels and can be configured as a split bus, with six devices on each channel. We wanted to maximize the available I/O bandwidth, so we went with the split bus setup. Having only one PCI slot in our 1U server, we needed a dual-channel differential SCSI card. An LSI U40HVD was just the ticket– $50 on Ebay. Also on Ebay, we picked up a couple of VHDCI-to-HD68 cables for $28 each to complete the connection. Lastly, it was back to Ebay for a lot of 12 36GB, 10K rpm SCA SCSI drives for $420. Total outlay for hardware: $526.

After installing Solaris on the internal ATA disk, we had 12 disks at our disposal. To maximize available space while retaining fault tolerance, we chose the RAID-Z model. Setting aside one disk on each channel as a spare, we were left with 10 disks, which should yield somewhere around 300GB of usable space.

The zpool creation involved two raidz vdevs, since the ZFS Administration Guide recommends that a raidz vdev contain fewer than 10 disks for optimal performance.

# zpool create homepool
     raidz c0t0d0 c1t0d0 c0t1d0 c1t1d0 c0t2d0
     raidz c1t3d0 c0t3d0 c1t4d0 c0t4d0 c1t5d0

The final tally was 332 GB of available space. Cool! Making the filesystem hierarchy was as easy as running a shell loop:

# for i in 1 2 3 5 6 7 9 10 11 12 15 16 17 18 20 21 22;
    zfs create homepool/buildbox-${i};
    zfs set compression=on homepool/buildbox-${i};
    zfs set sharenfs=anon=0 homepool/buildbox-${i};

I don’t recall exactly how long that took, but I’ll wager that I would not have run out of fingers if I had been counting.

The whole process of creating ZFS storage was delightfully easy. Each filesystem will employ compression and will be automatically mounted at boot and shared via NFS. We pass the “anon=0” option to sharenfs so that the root user can access files on the mount (developers need to “sudo make install” in their source trees, for example).

Mounting each export was not terribly difficult either, but there were some pitfalls. I quickly ran into a performance problem with NFS on ZFS, in the ZFS Intent Log (ZIL). As explained in the post linked below, the ZIL tracks file modifications that have immediate integrity requirements (such as fsync). Each write operation has a sequence number, and when a synchronous command comes in, the ZIL commits all I/O blocks up to that sequence number. However, as noted in this excellent blog post from Sun kernel engineer Roch Bourbonnais, the current state of the code is such that an fsync issued by one process causes *all* pending data for the filesystem to be flushed to disk before the fsync can return. When copying a large number of relatively small files (like a source checkout) over NFS, the sheer number of fsyncs causes a lot of blocking, making the overall operation painfully slow. I had to resort to tarring up the home directory contents, copying the tarball to the server and unpacking it locally.

During the process of converting the build machines to use the NFS exports, the server inexplicably locked up, seemingly due to problems with the ZFS filesystems. No clients could access the mounts, and even locally on the server, a ‘df’ would hang on the ZFS filsystems. There were no messages on the console, and since I did not have the foresight to launch the kernel debugger at boot time, I was stuck. I ended up having to do a warm reset to recover. The silver lining was that I had no data corruption as a result. Still, it was disturbing and I was unable to come up with an explanation.

A couple of non-ZFS related pitfalls also cropped up, namely that I had to synchronize all the UIDs with the server, and that on Solaris hosts, I had to tweak domain search order settings as noted in this blog post. Also on Solaris hosts, I had to disable the “auto_home” feature in /etc/auto_master, otherwise the /home or /export/home mountpoint would be occupied “invisibly”, preventing the NFS mount from succeeding after boot.

So, in the end, we have an operational home directory server that cost us only around $500 to set up, which provides enough storage for the forseeable future, and is easily manageable (I can use ZFS to reserve space for some filesystems at the expense of under-utilized ones, for example). I’m sure that future updates to ZFS will take care of the performance problems, and I’m now booting with kmdb in the background in case that mysterious hang manifests again.

I’d like to thank the folks from #opensolaris on Freenode for helping track down the cause of the NFS performance problem.

Back to top