bcache SSD testing and tuning

These are my results from a clustered FTP service running off a lizardfs filesystem, with bcache as the SSD caching component.

EnhanceIO is known to be a very very good choice. This time I had also tested bcache, these are the notes and the steps.

Enabling bcache makes you lose all the data on any block device involved!

 

Setup

  • Two LizardFS storage nodes
  • Three FTP instances, all running off a LizardFS mount
  • 2GB LizardFS writeback cache
  • 400GB off a 500GB NVME SSD as cache. (950pro, consumer model, 400TBW)
  • 15TB Disk per Node
  • 64GB Ram per Node
  • 12 Cores

 

Testing notes and results

  • Bcache hot-adding of new disks is midly underdocumented, as is destroying single bcache devices online
  • Performance counters of bcache are shit, mostly concerned with vanity metrics for collectd vs. algorightmic / tuning concerned info. Esp. worse  compared to flashcache (hit ratio, fill state is ok, but i.e. you can't see the cache size(!)). This means the "current" numbers can't be interpolated well. Need to check if you can relate hit rate to GB's transferred.
  • Bcache also assumes large disk raid vs. SSD, need to enable sequential IO caching to have _any_ result with bcache if you don't have the original author's setup. (if i understood correctly he used a non-bbu raid6)
  • Increased the sequential cutoff to 64MB be in line with the lizardfs chunk sizes
  • Bcache init script needed to be modified to scan NVME devices

  • Flash caching is able to flatten out concurrent r/w, but also causes it

  • On heavy sequential IO I didn't get over 8% cache fill rate, and the hitrate was somewhere around 20%, definitely decreasing with load. It might have been a lot better if because had made more use of the 400GB+ per node cache, but it didn't feel inclined to do since my use case didn't match what the software expects

It seems I had the wrong problem for bcache to help

Solutions that also might have worked

  • In total, EnhanceIO would have again been the best caching solution for this use case, since it even can be attached to existing filesystems (saves moving a few terabyte here or there), is easily DETACHED and generally nicer to manage.
  • I already tested EnhanceIO under similar load and it just sustains nicely. Not to forget, is dead nowadays...
  • Flashcache would have been the performance monster, I've seen flashcache doing a similar task and it also held up a lot better. I'm no longer using flashcache because it's pretty bug-ridden and heavily depends on 4K IO sizes. flashcache also had this "too specific" problem as it was designed non-virtualized processes, on "no-latency, never fails" FusionIO ultra-highend SSDs.
  • Didn't try LVM caching because my choice of distro (Alpine) often had tiny quirks with "active" LVM addons like thin provisioning.
  • ZFS ARC/L2ARC: Would have been happy, but to be fair I'd have needed MOAR RAM to use this. 64GB is too little for a 400GB L2ARC. It would have worked with 128GB L2ARC and probably performed OK.

 

things learned in passing

  • Check_MK not properly tracking queue lengths / service time and % busy made it a little hard to interpret anything
  • % busy counter of sar was actually showing 100% for the nvme devices no matter if 5 or 500 io's. We found a bug.
  • NVME vs single spindles: 10:1 write + read speed,  25:1 read on PCIe 3
  • smartctl cannot access NVME devices, but there's nvme-cli on shithub 
  • nvme-cli gives you access to most data you'd expect via SMART
  • 950pro seems to have no accessible sensors. (but will auto-throttle itself if it gets too warm...)

  • IOPS stayed around 700-800 even on Flash. Large IOs of course :)
  • a 8-lane M.2 -> PCIe adapter could hold 2 SSDs. Sadly you can't buy one yet.

Commands

Setup

Partioning a factory fresh NVME SSD with proper alignment

localhost:~# parted /dev/nvme0n1
GNU Parted 3.2
Using /dev/nvme0n1
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel gpt
(parted) mkpart primary 2048s 20G
(parted) mkpart primary 20G 420G

First partition is for /var/lib/mfs, the second is for caching.

 

In this step, remove all prior filesystem signatures, including bcache's own.

wipefs /dev/SSD0n1p2 && wipefs -a /dev/DISKa && wipefs /dev/DISKb
modprobe bcache
make-bcache --block 4k --bucket 2M -B /dev/DISKa /dev/DISKb -C /dev/SSD0n1p2

wipefs on alpine is part of util-linux. It's really helpful, especially since it calculates the position of the GPT backup label, I did that with dd at like 6am and it wasn't all that easy anymore. As you see, anything from -B to -C is considered a backing device, the attachment mechanism handles this nicely, my two disks ended up as two bcacheN devices, which I could then format into two independent devices.

The default parameters do not set the 4k / 2M settings, you will want to rebuild your device if you used those.

If you need to retry:

wipefs -a /dev/SSD0n1p2 && wipefs -a /dev/DISKa && wipefs /dev/DISKb
make-bcache --wipe-bcache --block 4k --bucket 2M -B /dev/DISKa /dev/DISKb -C /dev/SSD0n1p2

bcache's make-bcache has an option to overwrite an existing header but will only handle ONE header, so if you have 3 devices it'll never work. You need to wipefs -a all of them and then build the cache. if you change parameters, you again need to include a wipe.

 

Tuning

echo writeback > /sys/block/bcache0/bcache/cache_mode
echo writeback > /sys/block/bcache1/bcache/cache_mode
echo 50 > /sys/block/bcache0/bcache/writeback_percent
for n in /sys/block/bcache*/bcache/sequential_cutoff ; do echo 64M > $n ; done
for n in /sys/block/bcache*/queue/read_ahead_kb ; do echo 4096 > $n ; done

Missing here is doing queue depth settings and similar on the nvme device.

 

Format, Create mountpoints etc:

mkdir /srv/{mfs1,mfs2}
mkfs.xfs -Lmfs1 /dev/bcache0
mkfs.xfs -Lmfs2 /dev/bcache1
mount -o noatime,noexec /dev/bcache0 /srv/mfs1 &&
mount -o noatime,noexec /dev/bcache1 /srv/mfs2 &&
chown mfs: /srv/mfs[0-9]
echo "/srv/mfs1
/srv/mfs2" >> /etc/mfs/mfshdd.cfg

In my test I didn't add them to fstab or anything - if you're not doing the same step, please do add the entries and enable the service on boot :-)

 

Operations

Hot-adding disks

I managed to put them in the same cache set, but the flash device was not attached to the bcacheN device...

Hot-replacing broken backend disks

Undocumented, couldn't make do. My idea was to invalidate the active part of cache, remove the broken disk, reformat the bcache device. Something like that.

Goal: Fix it so userspace could resume. Didn't get that right, so I just left the broken device around, LizardFS flagged the filesystem as "damaged" and moved on.

Hot-destroying broken bcache devices

Undocumented, couldn't make do, general consensus is that this is dependant on the amount of bugfixes your kernel has for it.

Used to need a reboot(!!!)

Disabling individual caches

I used this to scrub the harddisks without putting wear on the SSD

(just in case the seq. cutoff would not have caught, anyway)

# echo none > /sys/block/bcache0/bcache/cache_mode

 

Flushing the cache

Set the
writeback_percent
to 0 and it'll flush the cache very quickly, no problem whatsoever.  

Monitoring and stats

Mind how the hitrate didn't scale very well with the load, but stayed rather erratic with the best hitrates during the idle hours - probably it cached all the filesystem health checks.
That's not completely bad since it also saves a few IOPS, but still far too little.

Verdict

Bcache has a too limited scope for how much it's hyped.

The bcache "documentation" doesn't indicate a high focus on usability, it is a nice introduction for the first code drop. Leaving it in that state is disrespect. 

The statistics are pretty weird from a monitoring perspective. 

The operational handling is awkward, even if you ignore not being able to attach to a live media. Which of course sucks if you need to copy a few TB just to get them cached. It might take days to copy data just to ... do this. I'd like to have an alternate path where I might just change the partitions start, and ask bcache to put it's header there. Especially since nowadays we all 2048-align our partitions this could be a good thing. But the toolset is 100% focussed on the case where you just lose + restore data.

Like all the other "open" SSD caching solutions it does not track hot zones and instead just caches what was recently used. The way it's associating cache and data makes it doubtful it will be able to change to a more professional caching scheme in the future.

By default it tries to skip all and any sequential IO which limits it's positive impact on latency, and also shows that it does assume a large disk array in the backend. I'm sure bcache can have positive impact for a software raid6 with 4 disks (suboptimal), I doubt bcache would outperform any recent LSI enterprise HBA even without flash, unless in some edge cases, say you have 10GB of 1KB files you're serving from a 256MB host.  Pitting it against a LSI nitro is probably a sad story. The reasoning originally was that the RAID would have higher throughput than the SSD because the author built it for Intels great X25-E. But now it's the age of 5GB/s SSDs and bcache was designed specifically for one that hovered around 300MB/s. 

The one thing it does well is consuming very low amounts of RAM. I think it's the best choice to use in NAS devices. I think about using bcache on my ARM5el NAS which only have 512MB Ram and keep giving troubles with SSD caching.

For use servers it ends up in the last position. If you need kernel integration, LVM cache is there, easy to use, and documented.

But if I could wish for something, then it'd be that all the energy put into bcache, it's hype and so on, had been used to improve EnhanceIO instead. Maybe then we'd have at least one SSD caching software that actually caches stuff that is accessed often.