User:Tim/NFS server build

From Van Essen Lab

(Difference between revisions)
Jump to: navigation, search
(SATA Controller Card)
Line 3: Line 3:
== Drives ==
== Drives ==
Storage: between 10 and 18 of one of the following:
Storage: between 10 and 20 of one of the following:
[ Western Digital Caviar Green 3TB] - 5400rpm, $150, cheapest 3TB from Western Digital - appears to come with a minimal 2 port SATA controller card, but likely will not serve our purposes (would need to use 3 or 4 of them)
[ Western Digital Caviar Green 3TB] - 5400rpm, $150, cheapest 3TB from Western Digital - appears to come with a minimal 2 port SATA controller card, but likely will not serve our purposes (would need to use 3 or 4 of them)

Revision as of 09:01, 24 June 2011

The goal of this project is to create 2 identical NFS servers from standard parts, having roughly 30TB of robust storage each, for a cost of around $4000 each. One of the two will be built first to ensure compatibility of parts, and to make any necessary adjustments to the build, following which the other will be built to the final working specifications of the first.



Storage: between 10 and 20 of one of the following:

Western Digital Caviar Green 3TB - 5400rpm, $150, cheapest 3TB from Western Digital - appears to come with a minimal 2 port SATA controller card, but likely will not serve our purposes (would need to use 3 or 4 of them)

Hitachi Deskstar 3TB - 7200rpm, $180, cheapest 7200rpm 3TB on newegg

Hitachi Deskstar 3TB - 5400rpm, $130, cheapest 3TB on newegg

Seagate Barracuda XT 3TB - 7200rpm, $215, cheapest Seagate 3TB on newegg

Boot drives: 0 or 2 of one of the following:

OCZ Vertex 30GB SSD - $85, cheap but reputable solid state drive with okay performance (per Tom's Hardware Guide) and enough size for OS, will greatly outperform rotational disks for booting

OCZ Vertex 2 40GB SSD - $105, slightly more space and significantly higher specifications than the above OCZ Vertex

OCZ Vertex 2 60GB SSD - $120, slightly more space, and better sustained write, nearly equivalent to Vertex 2E series

OCZ Agility 3 60GB SSD - $135, SATA III boosts speed to nearly double the Vertex 2, if motherboard supports SATA III

OCZ RevoDrive 50GB PCIe SSD - $215, PCIe interface lets us free up drive mounting points and SATA ports, these are normally used for extreme throughput for video editing, so extremely fast, but also more expensive

Seagate Barracuda 250GB - $37, cheap (slow) hard drive to let the OS boot without the SATA controller expansion card operating

SATA Controller Card

This is a little tricky, since RAID uses all disks simultaneously, especially when building or rebuilding an array, we want the controller card to be able to saturate the disk throughput (3TB drives seem to have about a 150MB/s max sequential transfer, and each lane of PCIe 1.0 gives 250MB/s transfer, so for 4 drives, an x1 controller won't cut it except for random access). Unfortunately, a lot of SATA controller cards use a PCIe x1 interface, and PCI (though it can theoretically handle 533 MB/s if the clock and bus widths are right) probably isn't quite up to the task of sequential access (which is used during raid array build/rebuild and accessing large files). However, this is less of a concern for serving files through NFS, which on a gigabit network should saturate somewhere around 120MB/s anyway.

A very small side consideration, hard drives have internal caches that can give information much faster than it can be read from the platters, if it was read or written recently, which can saturate the SATA connection (300 or 600 MB/s for SATA II and III), but if the interface of the card can't handle it, the cache won't be very effective.

1 of the following:

HighPoint RocketRAID mSASx3 PCIe x8 - $350, has hardware raid capability (which we don't plan on using), uses different cables (mSAS, not included, $50, there's also a $22 one out of stock) to connect 4 disks to one port on the card, but more importantly supports 12 disks in a single card, allowing us to use only 1 SATA card, in case having multiple controller cards installed runs into conflicts and keeps us from using more than 1 card at a time

TYAN mSASx4 PCIe x8 - $400, similar reasons, can connect to 16 disks with one card, if we desperately need the connectors, but PCIe x8 is just under the needed bandwidth to serve 16 disks at 150MB/s each

HighPoint RocketRAID mSASx4 PCIe 2.0 x16 - $450, similar reasons, but has enough bus bandwidth to nearly saturate all 16 of its SATA III connections (even at SSD speeds)

OR 1 or 2 of the following:

HighPoint RocketRAID 8xSATA/SAS(should be SATA II or maybe III) PCIe 2.0 x8 - $137, PCIe 2.0 has a massive 500MB/s per lane (backwards compatible with PCIe 1.0 at 250MB/s per lane), may handle 600MB/s or 300MB/s simultaneous throughput to each drive, unclear, but more than adequate

AND 1 of the following, if needed, or just 3 or 4 of the following:

Rosewill 4xSATA II PCIe x4 - $70, inexpensive SATA II controller with 4 ports, and 4 PCIe lanes to handle sequential throughput, also has 2 eSATA connectors (but with bus saturation, probably not that useful)

HighPoint RocketRAID 4xSATA/SAS(should be SATA II) PCIe x4 - $90, controller with RAID features (might be partially hardware raid), similar to above, but mentions staggered spinup, hot plug, manufacturer gets generally good reviews

Intel 4xSATA/SAS (600MB/s, should be SATA II or maybe III) PCIe 2.0 x4 - $156, appears to be a hardware RAID controller (though very limited RAID modes, but we don't need them), able to simultaneously serve 4 drives with 500MB/s throughput of the SATA III maximum of 600MB/s, far beyond our current needs


Lian Li aluminum ATX toolless full tower - $320, toolless case, 12 5.25" bays, recommended by Jon S

Lian Li aluminum ATX full tower - $290, very similar case, but requires tools for assembly.

XCLIO steel chassis ATX full tower - $210, cheaper case, also with 12 5.25" bays

Drive bays

up to 4 of SNT 5 in 3 3.5" to 5.25" hotswap cage - $100, activity, power, temperature, fan lights, puts 5 drives in the space of 3 bays for up to 20 drives for 51TB with hot spare

up to 12 of Kingwin hotswap rack - $19, has activity and power lights for each disk, activity light will help for locating a failed disk

up to 6 of Thermaltake 3 in 2 3.5" to 5.25" hotswap cage - $70, activity, power, temperature, fan lights, 3 drives in 2 bays for up to 18 drives into the case for 45TB with a hot spare, or 15 with two bays left over for optical drive or something else, reviews suggest fan is a bit loud

up to 6 of Icy Dock 3 in 2 3.5" to 5.25" hotswap cage - $85, activity, power, failure lights, temperature alarm, up to 18 drives, as above, but has an odd setup for the data cables that may not support SATA III (but that shouldn't matter, rotational drives can barely saturate SATA I), reviews suggest fan isn't as loud

0 or 2 of Koutech dual 2.5" and 3.5" in 5.25" - $31, fits both a HDD and a SSD in a single bay, activity lights for both, may not support SATA III for the SSD

0 or 2 of SilverStone 2.5" to 3.5" adapter - $17, hotswap adapter for putting an SSD into a 3.5" hotswap bay

0 or 1 of SNT dual 2.5" hotswap rack - $24, hotswap bays with activity lights designed for SSDs, but we would need to also get an adapter to put a 3.5" tray into a 5.25" bay

My preferred combinations:

3 5 in 3 HDD bays, 1 triple HDD bay, 1 double SSD bay - 18 disks

2 SSD/HDD combo bays, 3 5 in 3 HDD bays, 1 single HDD bay - 18 disks

2 SSD/HDD combo bays, 3 5 in 3 HDD bays, 1 optical drive - 17 disks

internally mounted SSDs, 4 5 in 3 HDD bays - 20 disks

internally mounted SSDs, 12 single HDD bays - 12 disks - for more room for airflow around disks, fewer small (noisy) fans, less chance of multiple disk failure

11 single HDD bays, 1 double SSD bay - 11 disks, more airflow

Power supply

PC Power and Cooling "Silencer mkII" 650W - $95, single high current 12V rail at up to 46 amps, 12 WD Caviar Green drives only need 21 amps if all 12 spin up at once

PC Power and Cooling "Silencer" 910W - $150, massive single 74 amp 12V rail (82 amps peak), for simultaneous spinup of large numbers of higher performance drives

System configuration

Instead of a proprietary embedded OS with a web interface, this machine will run a full-fledged desktop operating system, likely ubuntu 10.10, for ease of maintenance/configuration. Hardware raid controllers are expensive (especially if handling more than 4 disks), and tend to use proprietary methods to label disks, so the plan is to use linux's built in software raid drivers, which are more transparent and portable. As such, the disks only need to be connected to a standard SATA port, everything else is done in software. However, most motherboards don't have 12-14 SATA connectors, so a SATA controller card will be required for some of the storage array disks.

The most robust and high performance configuration would be to have two SSD drives, connected directly to the motherboard, in software raid 1, for booting and operating system, with as many 3TB drives as can fit in the case as a RAID 6 array (two drives from this array can fail without losing any data), with a hot spare (a disk already in the machine, but unused until a disk fails, used immediately to rebuild when a new disk is required to keep the array fully redundant). Assuming only 10 bays will be available to the 3TB drives, this gives capacity of 7 disks, or 21TB of raw space for storage. Without the hot spare, this increases to 24TB, and assuming we can put the SSDs somewhere other than a drive bay, it could give 27 or 30TB of raw space with 12 bays (depending on the hot spare).

Alternate configurations include using LVM to put the root filesystem on the same array as the main storage, with the drawback that an event that takes the storage partition down also takes the entire operating system down (such as failure of the controller card), but allows us to use 12 3TB drives with no separate boot drives, for 27 or 30TB of raw space, with no real performance hit.

A configuration that would keep all disks the same without sacrificing performance or bootability in the event of the storage array going down would be to use 2 3TB disks in a separate software raid 1 from the motherboard, for 21 or 24TB again.

One of the concerns for a raid 6 array on such a large amount of storage is that it may take a long time to build or rebuild, that is, to make the array able to suffer 2 disk failures without losing data. This must be done when the array is first created, or a disk is replaced (including being "replaced" by the hot spare). There is another more exotic setup that would alleviate this concern, though not without some drawbacks. Instead, the array could be a raid 1+0 array, where first the disks are paired, and each pair contains a perfect copy of the other in the pair, and then the data is written across these disk pairs with blocks being written to different pairs in a round robin fashion. Because all of the redundancy is simply in disks being mirror images, it is relatively fast to build or rebuild the array. This setup is currently the highest general performance raid setup, but you only get half the space, in our case, 15TB for 10 disks, 18TB for 12 (depending again on boot drive setup). The other main drawback is that failure of two drives in the same pair will cause data loss, so 2 failed disks has a chance to cause the array to become irretrievable (raid 6 requires 3 disks lost before it is irretrievable, but absolutely any 3 lost causes this, while raid 1+0 can tolerate one lost from each pair, if you are lucky).

If additional storage is required in the future, an external enclosure for drives with an eSATA interface with port multiplier may be the best option, such as this. Such an enclosure will require only 1 PCIe expansion slot, and we could use LVM2 to combine the new and old arrays into one large storage partition, despite them being disparate storage sizes. Another possible option is a USB 3.0 enclosure, but in either case, we would likely want the enclosure to report all its disks separately, and use software raid again.

Backup plan

The plan is to build two of these for a very specific reason: in case one somehow manages to die, even temporarily, the other should keep a copy of everything on it. Additionally, the plan is to keep old files on the backup server after they are deleted from the main server for a period of time, a daily job will copy new files from the entire volume, and a weekly (or longer) job will take care of removing old files from the backup server.

Backup between the two servers may be accomplished over a separate connection from the WUSTL network, if desired, and the servers are located in the same lab. The upgrade to gigabit network may make this irrelevant, depending on whether rsync can saturate the network (considering gigabit network clock is only a third of cpu clock, it may not be possible to saturate the network with 1 machine). If they can saturate the network, having a separate connection would leave the machines more responsive to their WUSTL network during backup, especially for operations not involving the main storage array.

Build plan

The most careful plan is to build the first machine incrementally, especially with regards to disks, ensuring that the particular drives ordered will work as expected, so that we don't end up buying 12 disks to find out they won't work with the SATA controller card. In particular, we could order 1 disk of several types, and check them each for compatibility and performance (and whether the activity light on the drive bay works with them) before deciding on the disks to order for the rest of the array.

Many of the other parts do not need to be high end or cutting edge, so there is less chance of compatibility problems. The other main concern is motherboard SATA/bios compatibility with large drives. Linux software raid should not place a large burden on the cpu, so cpu and ram can be fairly modest.

Miscellaneous Notes

mdadm metadata v0.90 may be required to be able to boot from an array (the /boot partition), but limits devices to 2TB size, v1.2 should probably be used on the main array.

Ubuntu default install doesn't have mdadm installed, but the live system or the alternate install CD seems to have the needed tools to set up a software raid 1 boot.

May need to install from USB stick or temporarily attach an optical drive, since all drive bays will have hotswap trays.

ext4 with default blocksize has maximum size of 16TB, need to use xfs or a larger blocksize, xfs has the additional advantage of having a proven defrag tool (so more file access can be sequential, though that is unlikely to matter for NFS usage due to network saturation), and fairs well against ext4 in benchmarks, except for creating/deleting lots of files at once (thousands per second), which we likely won't need.

Ubuntu's udev may rename block devices if rebooted after a device failure, this should be tested to see if it poses a problem.

For processing involving very high file IO, especially creating large (hundreds of MB) files and then reading from them, it would be good to write the file on the machine running the script, so that it doesn't need to write things over the network. This is a general rule, not something specific to this project, because even with a gigabit network, the network will saturate before a modern rotational drive (and will saturate long before a single SSD would). As such, if we build a high end computer or two to run intensive tasks on, we may want to set them up with SSD local storage for the IO performance, or a RAID array of decent performance, modestly sized disks, and have processing scripts use it as a scratch directory (under a subdirectory of their process ID, and delete stuff from it when they are done). An even higher performance, but smaller size option would be to make a ramdisk, that is, use a portion of the RAM as a temporary filesystem, but space is severely limited.

Personal tools
Sums Database