Raspberries and Bananas on NFS

This sounds like one of my Cooking Recipes. It is (mostly) gluten free and should not cause a methabolic mishap unless you are severely allergic to Linux. If you suffer from that particular malady, I am not sure what are you trying to do with a Razzie in the first place.

Rationale - IO to SD Cards Sucks Bricks Sidewize Through a Thin Straw

This may sound rude, but it is in fact the engineering description of the problem. It is by no means Razzie specific. It stinks equally bad on any platform except Chromebooks and similar hardware where the interface is mostly an emulation. There, the SoC is wired directly to the flash and there is no true SD card interface involved. If you are dealing with the real sd card interface, it is slow (a couple of MB/s), the IO Ops/s are laughable, it can stall for considerable periods when doing MLC flash "maintenace" and most importantly it uses an unreliable electrical interface. This is especially the case on a RaspberryPi. Its electrical contacts on the micro SD card interface leave a lot to be desired. I have no idea what did the guys at Sony (which manufactures them under contract) do here, as their own devices do not have anything like the RaspberryPi rate of SD card failures.

One option to deal with the IO issues is to meticulously isolate everything that does IO and try to minimize it. This option is limited in its applicability - you have to write your data somewhere and you have to read executables and data in order for your RapberryPi to be useful. This is the point where it is good to reconsider if it is worth it to tether it permanently via its Ethernet interface and put its SD card interface on an extended leave.

You can scale a RaspberryPi Fleet to an armada size this way - Raspberry is limited to 100MBit Ethernet (in fact less - it is off the USB bus on the SoC so its IO Ops/s while higher than the SD card interface are not very high). A modest desktop class Linux machine with sufficient memory and a Gigabit Ethernet can happily support a few hundreds IoT RaspberriesPi (especially if you run them of a de-duplicated filesystem). You also get to reuse tons of old SD cards which would otherwise go in the bin because they do not have sufficient capacity or are too slow to hold a modern OS.

None of this is rocket science by the way. It is not new either. In the days when disks were expensive and lots of workstations ran diskless doing what is described here was part of the sysadmin's job.

Creating the NFS filesystem

It is best to start with a fresh install. Grab a Raspbian or Rasbian Lite and install (at least) the nfs client package
# apt-get install nfs-client 

Mount the directory where you intend to keep your boot images on the server
mount myserver:/exports/boot /mnt 

Copy raspbian to the server. Linux copy's ability to preserve attributes and links is quite helpful here.

mkdir /mnt/raspbian-master
cp -rax / /mnt/raspbian-master

Edit the fstab in this newly created master:
proc            /proc           proc    defaults          0       0
# change this to your server IP address    /       nfs rw,ac,wsize=4096,rsize=4096,nolock      0       0
none            /tmp           tmpfs    defaults        0       0
none            /var/tmp       tmpfs    defaults        0       0

Note - we have moved /tmp and /var/tmp to a tmpfs in order to minimize the amount of traffic including high cost one (stat, lock, etc) to the NFS server.

We now have a base Raspbian image on the server to our disposal. We will use it as a master to create our Raspberry fleet. We can repeat the same procedure for other Arm SoC builds like Bananian and BananaPi, BeagleBoard, etc.

Setting Up a DHCP and NFS server

If you intend to use root on NFS for more than a few Razzies, running a proper DHCP server is pretty much a given. While in theory you can hardcode IPs, paths, etc into the boot params for each of them (as described in the diskless boot HOWTO) it will become unmanageable and unmaintainable very quickly.
apt-get install isc-dhcp-server

Edit the DHCP server config in /etc/dhcp/dhcpd.conf - there are some example entries there to begin with. You can kill them all and just do a simple basic config for your network:
option domain-name "kot-begemot.co.uk";
option domain-name-servers;
subnet netmask {
  option broadcast-address;
  option domain-name-servers;
  option routers;

restart the dhcp server
/etc/init.d/isc-dhcp-server restart

Install the nfs-kernel-server package on your Linux machine. Some Linux distributions disable the portmapper and other key components essential for NFS - you need to enable them in order for NFS to work. Create a basic exports file:
# /etc/exports
/exports  ,async,no_root_squash,no_subtree_check,nohide,fsid=root),async,no_root_squash,no_subtree_check,nohide,fsid=root),async,no_root_squash,no_subtree_check,nohide,fsid=root)

I traditionally put everything which will be shipped out via NFS under /exports (as was the recommended practice for SunOS many years ago before my beard was white). There are other alternative locations. It is better if the location is on a separate filesystem (see the Dark Magic section for more details). Restart NFS.

/etc/init.d/nfs-kernel-server restart

Create a Network Boot SD Card

Raspberry is quite different to all other ARM SoCs - it uses a two stage boot with a text config file. This differs from other Arm SoC like Exynos, Rockship or Mediatek which all use an arm script which you have to compile from the text form using appropriate tools.

So all you need to do on Raspberry is to change the cmdline.txt in the DOS partition on the SD card. You remove the current root= entry and add an entry for ip= to pick up data from DHCP . It is also worth it to change the elevator to noop (though this will not matter much as you are not doing block IO to speak of).

dwc_otg.lpm_enable=0 console=serial0,115200 console=tty1 root=/dev/nfs ip=dhcp elevator=noop fsck.repair=yes rootwait

It is possible to hardcode the ip params instead, but you are not going to get far with that. It also means that you can no longer use the same image for all Raspberries.

If we are doing this for a BananaPi we need to compile the script instead. We can find the source of the script on the boot partition. It is called boot.cmd. The boot partition is usually not mounted, so we need to mount it first:
mount /dev/mmcblk0p1 /boot

Install u-boot tools to be able to compile the script:
sudo apt-get install build-essential u-boot-tools uboot-mkimage

Edit the setenv line in boot.cmd to use nfs root:
setenv bootargs console=ttyS0,115200 console=tty0 console=tty1 root=nfs ip=dhcp elevator=deadline rootwait

Compile the script into binary form:
mkimage -A arm -O linux -T script -C none -a 0 -e 0 -n "BananPI boot script" -d boot.cmd /boot/boot.scr

In theory, we should be able to copy it around. In practice, this does not work, so we have to grab an image of the beginning of the SD card including the dos partition so we can clone it later. As the sizes may differ, we should get them using fdisk to make sure we are copying the correct portion.

Done - your boot sequence is now set for diskless boot.

Creating the Base Image

If you are wondering if you just did it - no, you have not. The image you created in the beginning is the one you fall back to in real emergencies. It is better not to have in it any of the software you really want to use to command Things as well as anything "unsafe". You clone it once by copying the whole tree using cp -rax on the server to create your local master.
mkdir /exports/raspbian-base
cp -rax /exports/raspbian-master/* /exports/raspbian-base

We also need to adjust the fstab NFS location in the new image to make sure it points to the correct path. We can now configure this as the image to boot from by adding/changing an entry to your DHCP server.
host model-2-2 {
        hardware ethernet b8:27:eb:6a:f3:6d; # adjust this to be your MAC address!!!
        fixed-address; # adjust for your network
        option root-path "/exports/raspbian-base";
        next-server; # this is the server you will be booting from - adjust to your IP address

We can now reload your DHCP server config and boot your raspberry from the SD card you prepared in the previous step. If you do not know the MAC for your raspberry, you can try to boot anyway, the boot will fail, but you will get an entry in the Linux machine syslog telling you the Raspberry MAC address.

Install all the packages needed. If you will be using NFSv4 anywhere (this sometimes shows up for v3 on Linux as well) you need all user IDs and Group IDs to be the same. Thus, it is better if you install key software from the very beginning, because these UIDs and GIDs are usually too small to be fed from the network - they are always read from the filesystem instead.

Setting up a New Raspberry

Setting up a New Raspberry becomes trivial.
  1. Copy the Base Image to a new directory. Adjust filesystem and other parameters as needed - you will need to make sure that the root fs entry in the fstab matches the root fs entry in the DHCP config file. While in theory, you should be able to pivot to a new root filesystem mid-boot in Linux, in practice, on Raspbian the most you can do is change the options - specifically to turn off locking and set the read and write sizes (as in your fstab).
  2. Copy the DOS part of the sd card image with the network boot commands to a spare SD card (it can be very small - only MB in size). You need only the FIRST (DOS) partition and you can in fact copy only the files - no need to dd the whole image. Try to boot the Pi in order to get the MAC address in the log.
  3. Create a new entry in the DHCP config file for your new Pi and reload the DHCP server config
host model-2-3 {
        hardware ethernet b8:27:eb:15:26:87;
        option root-path "/exports/boot/raspberry-m2-3";

Reboot using your new filesystem. You can repeat this procedure as many times as you like (provided you have enough disk space). If you are using the suggested optimization procedure in the next paragraph you are likely to be able to run a 100 Pis with ease (if you need to). It is faster than running it off most SD cards and most importantly significantly more reliable

Model Specific Notes

The best RaspberryPi model to use is 3. Pi Model 1 and 2 will all work and perform reasonably (by Pi standards), but their Ethernet is seriously brain-damaged. You will see up to 50% of the CPU available being used in a single kernel workqueue thread (this is the kernel shoveling requests to/from the USB-attached Ethernet). It is still comparable to the rather decrepit SD card read/write, but nowhere near what you can and should get from NFS.

Compared to that model 3 shines - I guess they did make some changes to the way the USB interacts with the platform. Kernel threads handling the workqueue to/from the Ethernet take 1-2% (that is normal - a low end x86 would use a similar amount). The whole system flies too.

It is not just NFS by the way - if we try other network intensive loads like for example running the Pi as an X Terminal the picture will be similar. The performance on the 1 and 2 is significantly worse than expected.

Dark Magic on your NFS server

The biggest issue with root-on-nfs is the waste of space and IO thrashing of the server. There are various ways to solve it, but probably the simplest and safest is to make use of a de-dupe capable filesystem. The key driver here is not so much the disk space (disks are cheap nowdays), but the IO latency. A deduped filesystem will allow an access to let's say /sbin/ifconfig to be cached in the server memory and used by all clients which want to read /sbin/ifconfig. If the filesystem was not deduped each of them would have read their "own" copy of the file from disk instead incurring a seek + a disk read.

In the case of Linux deduplication more or less means btrfs (On BSD we can use ZFS instead). It is worth to rebuild and repartition your server if need be.

Let's for our purposes assume that we have a btrfs partition on /dev/sdd1 which is mounted on /exports/boot. We first need to get dedupe working. In order to do that we will have to build the btrfs deduplication tool. Unfortunately it does not ship with current linux distributions.

The best way to do that is to follow the instructions on the bedup project page

Once we have it built and installed we need to defrag the btrfs filesystem and run bedup
btrfs filesystem defragment /dev/sdd1
bedup dedup /dev/sdd1 

It is best if we add this to a crontab entry redirecting output to /dev/null (it is quite verbose) so it is done regularly. I run it on a weekly basis, if you feel you need to do it more often - do so (in general you need to do it after each image cloning). Unfortunately, neither bedup, nor btrfs itself is particularly good in displaying deduplication statistics so you end up flying blind here. The upside compared to other approaches is that even if you upgrade images, install/remove software, etc once the offline deduplication process has run it will take care of any duplicated data. The downside is that the deduplication is not inline. So if you are doing an image upgrade, you either need to rerun it after upgrading each raspberry or stagger the upgrades in order not to run out of disk space on the BTRFS filesystem.

There is not that much difference in NFSv3 over TCP with no locking and NFSv4 in terms of performance. If you will be just running a dumb Razzie fleet where each of them is to their own you can stay with NFSv3. If you want to share a directory across all of them, you will probably need to go to NFSv4.

Quick And Dirty NFSv4 Configuration

NFSv4 tries to be smart for you and uses names instead of UIDs and GIDs in the transactions. This is likely to result in tears in a lot of cases unless you have all of them the same. Thankfully, they all originate from a common Base Image progenitor so the likelihood that they will differ is very low.

NFSv4 requires two bits of configuration:
  • Configure your root fs on the server. It was originally intended that this is your proper root which is clinically insane security-wize. I would suggest making your /exports or a directory to share your root fs. This is done by giving it fsid=0 in exports. Note - all of your other file systems accessible over NFSv4 must reside under this directory. For example:
/exports  ,async,no_root_squash,no_subtree_check,nohide,fsid=root),async,no_root_squash,no_subtree_check,nohide,fsid=root),async,no_root_squash,no_subtree_check,nohide,fsid=root)
  • Configure your naming domain in /etc/idmapd.conf
Domain = kot-begemot.co.uk

The domain must be the same on all clients and servers. NFS must now be restarted everywhere:
/etc/init.d/nfs-common restart

You can now mount nfsv4 exports in your fstab on the clients:
# note - the path is relative to the fsid=0 entry in /etc/exports on the server    /shared-space       nfs4 rw,ac,wsize=4096,rsize=4096,nolock      0       0

-- AntonIvanov - 02 Mar 2017
Topic revision: r6 - 08 Mar 2017, AntonIvanov

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback