User:Tim/NFS server build

From Van Essen Lab

< User:Tim
Revision as of 21:23, 31 December 2013 by Tim (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

NOTE: this page is obsolete, the actual configuration used was supermicro motherboard/chassis, LSI HBAs, and OpenIndiana, using ZFS.

The goal of this project is to create 2 identical capacity NFS servers from standard parts. One of the two may be built first to ensure compatibility of parts, and to make any necessary adjustments to the build, following which the other will be built keeping in mind any lessons learned from the first.

Contents

Build Lists

This plan is nearing implementation, here are the three ordering stages planned.

stage 1 - $4,000 - first system with only a few test disks, to check compatibility

stage 2 - $3,000 - disks to fill the bays, plus 2 spare data drives and 1 spare boot drive, plus revised parts (fan cable and second controller card)

stage 3 - $7,000 - will be approximately a combination of above two lists, plus any revisions made to get the first system behaving as desired.

Parts List

63TB configuration with parts compatible with solaris - $7,000 or $7,600

63TB configuration with i7-980 and 24GB RAM, LSI controllers and Supermicro board - $6,600 - After consulting OT4 storage group, this configuration should be more compatible with ESXi and solaris for ZFS rather than linux software RAID, which should behave better with consumer SATA disks

63TB configuration with i7-980 and 24GB RAM, highpoint controller and ASUS board - $6,500 - Uses a chipset with more PCIe lanes in order to be able to expand to an external enclosure (main controller runs at x16 for internal array, graphics and external controller run at x8)

Drives

Storage: between 10 and 20 of one of the following:

3TB drives:

Hitachi Deskstar 3TB - 5400rpm, $130, cheapest 3TB on newegg, but also best rated

Hitachi Deskstar 3TB - 7200rpm, $160, cheapest 7200rpm 3TB on newegg

Western Digital Caviar Green 3TB - 5400rpm, $150, cheapest western digital on newegg, but reviews are not as good

2TB drives, if we don't need as much storage, may be slightly better in GB per dollar if we have a lower target capacity:

Western Digital Caviar Green 2TB - 5400rpm, $80, $75 on tigerdirect

Samsung EcoGreen 2TB - 5400rpm, $80, $75 on tigerdirect

Hitachi Deskstar 2TB - 5400rpm, $80

Seagate Barracuda Green 2TB - 5900rpm, $80


Boot drives: 0 or 2 (or 2 and 1?) of one of the following:

WD Scorpio Blue mobile 160GB - $42, cheap (slow) 2.5" drive to fit in smaller bays, mobile drives may run cooler than other 2.5" drives

Seagate Barracuda 250GB - $37, cheap (slow) 3.5" hard drive to let the OS boot without the main array operating, but giving up 3.5" hotswap bays

SSD drives are not intended to be used in RAID, because they have an internal map of used blocks. When writes occur, they try to map them onto less-used blocks, since flash storage tends to wear out over time. However, if the entire disk is always in use (as is the case in linux software RAID, except possibly for RAID 0, which has absolutely no protective qualities), it can't do this remapping well, since some blocks will never get rewritten, and thus the pool of unused blocks shrinks to almost nothing. This causes writes to occur to the same very small set of blocks, which get far more write usage than the rest of the drive, leading to early failure. A possible solution is LVM mirroring, but it requires a third physical volume for the mirror log (perhaps 2 SSDs and 1 laptop drive?)

OCZ Vertex 30GB SSD - $85, cheap but reputable solid state drive with okay performance (per Tom's Hardware Guide) and enough size for OS, will greatly outperform rotational disks for booting

OCZ Vertex 2 40GB SSD - $105, slightly more space and significantly higher specifications than the above OCZ Vertex

OCZ Vertex 2 60GB SSD - $120, slightly more space, and better sustained write, nearly equivalent to Vertex 2E series

OCZ Vertex 3 60GB SSD - $145, SATA III boosts speed to nearly double the Vertex 2, if motherboard supports SATA III

SATA Controller Card

This is a little tricky, since RAID uses all disks simultaneously, especially when building or rebuilding an array, we want the controller card to be able to saturate the disk throughput (current 3TB drives seem to have about a 150MB/s max sequential transfer, and each lane of PCIe 1.0 gives 250MB/s transfer, so for 4 drives, an x1 controller won't cut it except for random access). Unfortunately, a lot of SATA controller cards use a PCIe x1 interface, and PCI (though it can theoretically handle 533 MB/s if the clock and bus widths are right) isn't up to the task of sequential throughput either.

1 of the following:

HighPoint RocketRAID mSASx3 PCIe x8 - $350, has hardware raid capability (which we don't plan on using), uses different cables (mSAS, not included, $35) to connect 4 disks to one port on the card, but more importantly supports 12 disks in a single card, allowing us to use only 1 SATA card, in case having multiple controller cards installed runs into conflicts and keeps us from using more than 1 card at a time

TYAN mSASx4 PCIe x8 - $400, similar reasons, can connect to 16 disks with one card, if we desperately need the connectors, but PCIe x8 is just under the needed bandwidth to serve 16 disks at 150MB/s each

HighPoint RocketRAID mSASx4 PCIe 2.0 x16 - $450, similar reasons, but has enough bus bandwidth to nearly saturate all 16 of its SAS connections (even at SSD speeds)

HighPoint RocketRAID mSASx6 PCIe 2.0 x16 - $620, similar reasons, lets us populate a case with more than 20 total drives in use, especially server cases that already have hotswap bays which connect via mSAS instead of individual SATA


OR 1 or 2 of the following:

HighPoint RocketRAID 8xSATA/SAS(should be SATA II or maybe III) PCIe 2.0 x8 - $137, PCIe 2.0 has a massive 500MB/s per lane (backwards compatible with PCIe 1.0 at 250MB/s per lane), may handle 600MB/s or 300MB/s simultaneous throughput to each drive, unclear, but more than adequate


AND 1 of the following, if needed:

Rosewill 4xSATA II PCIe x4 - $70, inexpensive SATA II controller with 4 ports, and 4 PCIe lanes to handle sequential throughput, also has 2 eSATA connectors (but with bus saturation, probably not that useful)

HighPoint RocketRAID 4xSATA/SAS(should be SATA II) PCIe x4 - $90, controller with RAID features (might be partially hardware raid), similar to above, but mentions staggered spinup, hot plug, manufacturer gets generally good reviews

Intel 4xSATA/SAS (600MB/s, should be SATA II or maybe III) PCIe 2.0 x4 - $156, appears to be a hardware RAID controller (though very limited RAID modes, but we don't need them), able to simultaneously serve 4 drives with 500MB/s throughput of the SATA III maximum of 600MB/s, far beyond our current needs

Case

Norco 4U rackmount server case - $425, has 24 built-in 3.5" hotswap drive bays, is cheaper with more bays than getting separate bays to populate a standard case, though it doesn't look as nice

Lian Li aluminum ATX full tower - $290 requires tools for assembly, but drive bays are probably easier to add to tooled cases due to their more standard mounts, recommended by Jon S.

Lian Li aluminum ATX toolless full tower - $320, similar toolless case, 12 5.25" bays.

XCLIO steel chassis ATX full tower - $210, cheaper case, also with 12 5.25" bays

Drive bays

up to 4 of SNT 5 in 3 3.5" to 5.25" hotswap cage - $100, activity, power, temperature, fan lights, puts 5 drives in the space of 3 bays for up to 20 drives for 51TB with hot spare

up to 12 of Kingwin hotswap rack - $19, has activity and power lights for each disk, activity light will help for locating a failed disk

up to 6 of Thermaltake 3 in 2 3.5" to 5.25" hotswap cage - $70, activity, power, temperature, fan lights, 3 drives in 2 bays for up to 18 drives in the case for 45TB with a hot spare, or 15 with two bays left over for optical drive or something else, reviews suggest fan is loud

up to 6 of Icy Dock 3 in 2 3.5" to 5.25" hotswap cage - $85, activity, power, failure lights, temperature alarm, up to 18 drives, reviews suggest fan isn't as loud as above

0 or 2 of StarTech 2.5" in PCI slot bay - $30, allows us to put SSD or small HDD boot drives into PCI slots (case has 10 of them, graphics card uses 2, SATA controller card uses 1), intended to use a PCI connector on motherboard for stabilization, but we may want to give the other cards more breathing room

0 or 2 of Koutech dual 2.5" and 3.5" in 5.25" - $31, fits both a HDD and a SSD in a single bay, activity lights for both

0 or 2 of SilverStone 2.5" to 3.5" adapter - $17, hotswap adapter for putting an SSD into a 3.5" hotswap bay

0 or 1 of SNT dual 2.5" hotswap rack - $24, hotswap bays with activity lights designed for SSDs, but we would need to also get an adapter to put a 3.5" tray into a 5.25" bay

My preferred combinations:

server case with 24 bays built in, two 2.5" in PCI bays - 24 disks

four 5 in 3 3.5" bays, two 2.5" in PCI bays - 20 disks

three 5 in 3 3.5" bays, two 2.5"/3.5" combo bays, one single 3.5" bay - 18 disks

three 5 in 3 3.5" bays, one triple 3.5" bay, one double 2.5" bay (with 3.5" to 5.25" adapter) - 18 disks

three 5 in 3 3.5" bays, one triple 3.5" bay, two 2.5" in PCI bays, one optical drive - 18 disks, optical drive

two 2.5"/3.5" combo bays, three 5 in 3 3.5" bays, one optical drive - 17 disks, optical drive

internally mounted 2.5" drives, four 5 in 3 3.5" bays - 20 disks

Power supply

PC Power and Cooling "Silencer" 910W continuous - $150, single 74 amp 12V rail (82 amps peak), for spinup of a medium number of drives

PC Power and Cooling "Turbo-Cool" 1200W continuous - $450, 100 amp 12V rail (115 amps peak), for even more strenuous spinup requirements

Thermaltake Toughpower Grand 850W continuous - $280, split 12V rails dedicate 40 peak amps to CPUs, and 85 peak amps to everything else, 1200W peak total output, main advantage is modular cables

Placement

There are 3 small side rooms in the area that Matt, Jon, and the summer students use, each with ethernet ports and air vents. After closing the doors, any objectionable noise the machines may make due to extra case fans, drive bay fans, or disks may be attenuated considerably. However, there is currently some equipment related to chimpanzee research in all of them, and we would want to clear some space for a table and chair for any required maintenance.

System configuration

Instead of a proprietary embedded OS with a web interface, this machine will run a full-fledged desktop operating system, likely ubuntu 10.10 (but possibly FreeNAS, PC-BSD or Solaris, each of which has stable support for ZFS), for ease of maintenance/configuration. Hardware raid controllers are somewhat expensive (especially if handling more than 4 disks), and tend to use proprietary methods to label disks, so the plan is to use linux's built in software raid drivers, which are more transparent and portable. As such, the disks only need to be connected to a standard SATA port, everything else is done in software. However, most motherboards don't have 12-20 SATA connectors, so a SATA controller card will be required for some of the storage array disks.

FreeNAS is an OS based on FreeBSD which is intended to be used in an embedded NAS device, and as such does most things through a web interface rather than a local login. Among the more interesting attributes of using it is that it uses ZFS instead of a filesystem on top of (and therefore ignorant of) software raid, which provides both RAID-like disk failure tolerance and additional features to detect data corruption (a read or write error that isn't detected by the disk hardware, very unlikely for a single disk, but the chance increases with many disks). Unfortunately, FreeNAS allocates some swap space on every single disk in the array (no idea what it would need to swap out) which could cause a system crash/hang if a disk with something swapped to it fails (a reboot should fix it, but ideally we would want to replace disks without rebooting/downtime). If we want ZFS, but to use the OS directly from the machine (with control of partitioning), Solaris (where ZFS was originally developed) is now free and open source, though unlike FreeNAS, it does not yet support 64bit architecture as far as I know. We could also use a desktop variant of FreeBSD (which does support 64bit architecture). There is a beta version of ZFS for linux, currently under development for Lawrence Livermore National Laboratory, which we may want to investigate, or move to in the future when the implementation is completed. Btrfs is also similar to ZFS, but is developed on linux (and is not yet up to the feature set of ZFS, notably parity based redundancy, like RAID 6, which is something we would want to use), which may be another option in the future. Btrfs on a software RAID 6 volume may provide some of the relevant benefits of ZFS, but without being aware of the RAID redundancy, it could not easily correct files that have corruption detected in them.

The most robust configuration would be to have two boot hard disks, connected directly to the motherboard, in a software RAID mirror (or 2 SSDs in a LVM mirror, but need to find a location for the LVM mirror log), for booting and operating system, with as many 3TB (or 2TB) drives as can fit in the case as a RAID 6 array or ZFS filesystem (two drives from this array can fail without losing any data), with a hot spare (a disk already in the machine, but unused until a disk fails, used immediately to rebuild when a new disk is required to keep the array fully redundant). 18 disks gives 45TB, 20 gives 51TB, and 24 gives 63TB.

Alternate configurations include using LVM to put the root filesystem on the same array as the main storage, with the drawback that an event that takes the storage partition down also takes the entire operating system down, including the utilities to diagnose the problem (such as failure of the controller card, or boot image issues).

A configuration that would keep all disks the same without sacrificing array performance or bootability in the event of the storage array going down would be to use 2 3TB disks in a separate software raid 1 from the motherboard, for a sacrifice of 2 disks worth of size from the main array.

One of the concerns for a raid 6 array on such a large amount of storage is that it may take a long time to build or rebuild, that is, to make the array able to suffer 2 disk failures without losing data. This must be done when the array is first created, or a disk is replaced (including being "replaced" by the hot spare). There is another more exotic setup that would alleviate this concern, though not without some drawbacks. Instead, the array could be a raid 1+0 array, where first the disks are paired, and each pair contains a perfect copy of the other in the pair, and then the data is written across these disk pairs with blocks being written to different pairs in a round robin fashion. Because all of the redundancy is simply in disk pairs being mirror images, it is relatively simple, and therefore fast, to build or rebuild the array. This setup is currently the highest general performance raid setup, but you only get half the space, in our case, 15TB for 10 disks, 18TB for 12, 27TB for 18, and 30TB for 20. The other main drawback is that failure of two drives in the same pair will cause data loss, so 2 failed disks has a chance to cause the array to become irretrievable (raid 6 requires 3 disks lost before it is irretrievable, but absolutely any 3 lost causes this, while raid 1+0 can tolerate one lost from each pair, if you are lucky). However, as long as the bottleneck for building the array is the disks themselves, and not bandwidth available to the SATA controller card(s), the build time on a raid 6 array should, theoretically, only depend on the size of a single disk, not how many are in the array.

If additional storage is required in the future, replacing the disks one at a time (and waiting for the array to rebuild each time) with larger capacity drives is probably the best option, as once the last disk is replaced, the array can be expanded to fill the drives, followed by a filesystem resize, resulting in a larger filesystem still containing all data, with downtime only to unmount the filesystem and resize it (which would be easier if the root filesystem was not on the array). The old disks can then be repurposed, placed in older machines to give them more local storage capacity (some of our machines are using 300GB or smaller disks), though we will probably have more disks than places to use them. Additionally, an external enclosure using eSATA with port multiplier, such as this would allow us even more total expansion for the NFS servers, if needed. Such an enclosure will require only 1 PCIe expansion slot (if available, server boards often don't have many PCIe slots), and we could use LVM2 to combine the new and old arrays into one large storage partition, despite them being disparate storage sizes. Another possible option is a USB 3.0 enclosure, but in either case, we would likely want the enclosure to report all its disks separately, and use software raid again, rather than relying on the enclosure to do RAID on its disks, as it may not support raid 6 or a hot spare.

Backup plan

The plan is to build two of these for a very specific reason: in case one somehow manages to die, even temporarily, the other should keep a copy of everything on it. Additionally, we could keep old files on the backup server after they are deleted from the main server for a period of time, a daily job could copy new files from the entire volume without deleting old files, and a weekly (or longer) job would take care of removing old files from the backup server.

Backup between the two servers may be accomplished over a separate connection from the WUSTL network, if the servers are located in the same lab. The upgrade to gigabit network may make this irrelevant, depending on whether rsync can saturate the network (considering gigabit network clock is fully a third of high cpu clocks, it may not be possible to saturate the network with 1 machine). If they can saturate the network, having a separate connection would leave the machines more responsive to their WUSTL network during backup, especially for operations not involving the main storage array.

Build plan

The most careful plan is to build the first machine incrementally, especially with regards to disks, ensuring that the particular drives ordered will work as expected, so that we don't end up buying 12-26 disks to find out they won't work with the SATA controller card. In particular, we could order 1 disk of several types, and check them each for compatibility and performance (and whether the activity lights on the drive bays work with them) before deciding on the disks to order for the rest of the array.

Some research has gone into correlated drive failure, in that among the best predictors of failure rate is the particular batch the drive was manufactured in. As such, if all the disks are from the same batch, and one fails, it is somewhat more likely that another disk or two will fail during the rebuild process, potentially causing the array to be irretrievable. Using disks from more than one manufacturer will not have correlated failure, diminishing this risk. As such, we may want to populate the array with disks of 2 or even 3 different manufacturers and models.

Many of the other parts do not need to be high end or cutting edge, so there is less chance of compatibility problems. Linux software raid should not place a large burden on the cpu, so cpu and ram can be fairly modest. However, given the amount of disk bandwidth available that can't fit through the NFS bottleneck, it makes sense to spend a little more in order to run processing locally on the system.

Miscellaneous Notes

mdadm metadata v0.90 may be required to be able to boot from an array (the /boot partition), but limits devices to 2TB size, v1.2 should probably be used on the main array.

Ubuntu default install doesn't have mdadm installed, but the alternate install image has the needed tools to set up a software raid 1 boot.

May need to install from USB stick or temporarily attach an optical drive, since all drive bays may have hotswap trays.

ext4 with default blocksize has maximum size of 16TB, need to use xfs or a larger blocksize, xfs has the additional advantage of having a proven defrag tool (so more file access can be sequential, freeing up disk bandwidth for resyncing or local processing), and fairs well against ext4 in benchmarks, except for creating/deleting lots of files at once (thousands per second), which we likely won't need.

Ubuntu's udev may rename block devices if rebooted after a device failure, but this should not pose a problem for mdadm while recognizing array devices.

For processing involving very high file IO, especially creating large (hundreds of MB) files and then reading from them, it would be good to write the file on the machine running the script, so that it doesn't need to write things over the network. This is a general rule, not something specific to this project, because even with a gigabit network, the network can saturate before a modern rotational drive (and will saturate long before a single SSD would). We could also build the primary NFS server with lots of RAM and processing power, and run processing locally on it, the RAID array could have throughput above that of an SSD due to its parallel nature.

There are bays to put up to 6 2.5" SATA drives into one 5.25" bay, however the maximum capacity I have found for 2.5" SATA internal drives (including SSDs) that will fit the 9.5mm height is 1TB at $110 a drive, which works out barely more storage total than using 3.5" 3TB drives, for a much higher price. However, having many more parallel drives makes it potentially a much higher performance setup (assuming we can squeeze enough controller cards in, each having enough dedicated lanes to handle the full throughput), but we have no use for such performance (even locally, the machine would have trouble saturating the IO), especially at that price. There are server cases with hot swap bays for 2.5" disks, but they don't go beyond 72, which merely matches the amount of storage we can get from 3.5" drives. Also, having very large numbers of drives means we would need to group the disks into multiple arrays for robustness, which complicates the setup.

Personal tools
Sums Database