Yesterday we experienced our third storage outage on the Ceph cluster which provides network storage for the pods in our .com / DE cluster. As before, the issue seems to have been triggered by the sheer amount of volumes we are backing up on a daily basis, each of which requires snapshotting, cloning, backing up, and then deleting.

What went wrong?

Near as I can tell, there may be a bug / incomplete feature in our current version of rook-ceph / ceph-csi (v1.13, and v1.15 is the latest) or ceph itself (v17.2, and v19 is the latest), which can cause the ceph cluster get "stuck" on cloning/snapshotting, rapidly run out of space (we're talking about growth from 4TB to 9TB in a matter of hours), and halt all pods with some nasty consequences (corrupt databases, empty config files, etc).

Upgrading / debugging ceph "in production" is very risky - consequences are far-reaching and irrevocable (which is why we're still 2 major versions behind the latest!), but given our scale, it's not feasible to establish a similar test / lab environment to safely test an upgrade.

So what to do?

We've been trialing a "cephless" cluster design for the US cluster (.cc), and today I started the process of migrating our .com cluster to the same design. Under the "cephless" design, we take advantage of the fact that Kubernetes "ReadWriteOnce" volumes can actually be mounted to multiple pods simultaneously, provided those pods are all on the same node. We use local NVMe storage on the nodes themselves, rather than relying on network storage, and perform 24h backups to mitigate the risk of data loss.

Over the past 2 months, we've gradually refined this design in .cc, and have laid the groundwork for the same setup in .com. You'll have noticed "nodefinder" pods in your Kubernetes dashboard - these are like "scouts", ensuring that all your pods are placed on a node with appropriate available resources for your bundle type.

What are the tradeoffs?

Ceph was our original cluster design, and gives us the following advantages:

  • Your pods can run on any node, they're not "bound" to the node with their storage on them.
  • We can have a big "pool" of storage (currently ~6TB) from which your volumes can be allocated, and storage can be overcommitted (i.e., we can allocate 100GB of storage per Plex user, knowing that most users will only use < 20GB)

However, as time has gone by, we've realized the following disadvantages:

  • An interruption to the complex mechanics of the ceph cluster impacts all users, all apps.
  • Any sort of scheduled maintenance on the ceph cluster (we had one planned switch outage from Hetzner last year), requires all users workloads to be scaled down, a process taking hours of careful planning and more hours of execution.

TopoLVM, on the other hand, has the following disadvantages:

  • You can't overcommit storage from a giant pool of storage, but we can "thin provision" volumes from the base node's available capacity, so we can allocate, for example 20TB out of an actual 1TB available (so, almost the same thing, just on a 1:20 ratio)
  • Your pods are forever bound to the node on which their storage is attached, meaning they can't just be rescheduled if there's a node hardware issue. Now that most of our users are on semi-dedicated bundles, we do 24h backups, and the most users on a node would be 8 hobbits, this is not a significant hurdle. (also, we've had zero node hardware failures, and 3 ceph outages!). If we needed to replace a node, we could schedule a backup/delete/restore for each affected user.

TopoLVM brings the following worthwhile advantages to the table though:

  • Reduced blast radius. A broken volume / storage only affects one node, leaving the other 25+ happily running.
  • It's simple. Simple things are harder to break.
  • NVMe storage is faster, and having all your pods on the same node is advantageous for many reasons, one of which is that FileBrowser will once again be able to see all your volumes, including the Aars and Plex!

So what's the plan?

Tonight's update will "test the waters" - we'll migrate gatus (the health tab), and Riven (risk-tolerant beta users), and wait for feedback. In the meantime, snapshots/clones/backups of the old, per-app volumes are disabled, to avoid the risk of another cascading I/O issue. All going well, we'll migrate some of the less-used / smaller apps next (Tautulli, Overseerr, etc) next, and save the 3 little pigs (Plex, Emby, Jellyfin) for last!