Push Mastodon Redis Sentinel
Publish posts / sync (push) Successful in 30s Details

This commit is contained in:
Gabriel Simmer 2024-03-19 21:17:18 +00:00
parent f83fcaf26d
commit 71d0193318
Signed by: arch
SSH Key Fingerprint: SHA256:m3OEcdtrnBpMX+2BDGh/byv3hrCekCLzDYMdvGEKPPQ
1 changed files with 18 additions and 0 deletions

View File

@ -0,0 +1,18 @@
#+title: Mastodon Redis Sentinel
#+date: 2024-03-19
In my quest to build stupid overkill infrastructure, I've included the Floofy.tech Mastodon instance in my endevour (it's okay, my co-owner has too). Part of this has been making sure that we have redundancy where possible, whether it be a high-availability Kubernetes cluster or multiple instances of certain networking portions to make sure traffic can flow while we work on upgrades or the like. One thing that has been missing is a piece rather critical to Mastodon and its queue runner (Sidekiq) - Redis. Redis is used by Mastodon as both a cache as well as a persistant store for Sidekiq jobs, and is required to be up and functional for an instance to operate. However, it didn't have a mechanism to account for a Redis Cluster or Redis Sentinel setup, and would have required some work with another load balancer like HAProxy to get working.
Some background on Redis high availability - there are two options, clustering or Sentinel. Clustering is a more traditional HA approach - you have several Redis instances working together as one homogeneous unit, redirecting clients to nodes that contain to data they need to read or write (notably, not /proxying/ the requests, but rather telling the client to retry with another machine). Data is sharded in a way where multiple Redis instances can contain the same data, so if some go down the data is still available. While this is great for performance and scaling up Redis, it comes at the cost of consistency - a write to one instance does not instantly propogate to another (but does relatively quickly). We also have Redis Sentinel, where a set of one or more Sentinel services connects to two or more Redis instances and monitors them. One Redis instance is elected the initial "master" (this is the term Redis uses, for simplicity it's the term I'll be using) and the Sentinel instance(s) provide this information to clients. Clients can check in with the Sentinel instances for a new master if they lose connection with the Redis instance, and write to the master instance will propogate to other instances monitored by the Sentinel instance(s). These secondary Redis instances are configured as replicas of the master, using the same replication method as a Redis Cluster. /However/, since we only ever read and write from one Redis instance, we worry much less about lost data. There /is/ still an opportunity for this to happen if the master drops out unexpectedly, but there are ways to reduce the window of opportunity for this. Because of this key difference, Sentinel is suitable for use with Sidekiq and other backends that require stronger consistency than what a Redis Cluster can offer.
With all that said, Mastodon did not support Redis Sentinel (or Redis Cluster). While Ruby on Rails, Sidekiq, and ioredis (used for the websocket server) all supported it, there were no configuration options exposed to enable it in the app - so I took it upon myself to make [[https://github.com/mastodon/mastodon/pull/26571][the pull request]] enabling it. While writing this it's still a bit of a work in progress, but we'll get to the current state of the PR in a moment.
This is really my first time using Ruby and Ruby on Rails, and initial impression was not one of dislike. At least on the Ruby side - for frameworks like Rails or Django, I tend to be a little less happy about working with due to the amount of magic they attempt to do for a smoother developer experience. Thankfully the changes that had to be made were fairly small and contained, and the majority of my time was taken up testing behaviour to make sure it did what I expected. The biggest point of concern was time to recovery - if the master Redis instance goes down, how long until a secondary is promoted and how long after that does the Mastodon instance pick up the change? The answer is "it's configurable" and also depends on the failure mode - a graceful shutdown will trigger a graceful handover, while a sudden loss will take a couple minutes by default to swap over.
The pull request still needs some work, and a bit of code cleanup, but it is functional, with some caveats. The documentation around using Redis Sentinel with redis-rb is a little sparse, and the biggest question mark is around DNS resolution for determining Redis Sentinel servers to connect to. It's something I'm currently working on, but in the meantime...
I deployed the patch to Floofy.tech. Something something, don't test in production, but frankly the only way to properly test this is with a setup like Floofy.tech, with real traffic. There was a minor issue where the Mastodon deployments couldn't find the Sentinels to connect to, but this was solved by a restart of the pods - I can't recall exactly what went wrong in this situation, but it was likely applying the changes needed out of order. Those changes included increasing the number of replicas for our Redis Helm Chart, pointing the Mastodon deployments to our custom image (glitch-soc Mastodon fork with the Redis Sentinel patch applied), and adding the configuration to Mastodon's environment. Once the pull request is closer to completion I'll likely update the pull request I have for the docs with instructions for migrating from a single Redis insance to a Sentinel setup, but in the meantime those are my cliff notes.
Despite all the testing I did during the pull request process, I decided to do a quick "cut test" with Floofy.tech, deliberately simulating various failure modes for our Redis instances and watching the failover. This included powering off a Kubernetes node unexpectedly, powering it off gracefully, and deleting the pod. In all cases, Redis Sentinel performed as expected, where the unexpected shutdowns forced a failover after a short delay and expected shutdowns performed a proper handover. Amusingly, the nodes rebooted so quickly that a simulated unexpected reboot didn't have much of an impact - it had to be force stopped instead. Overall, though, I was satisfied with the results and with my findings in hand I called it a day.
While there is still some work to do on the pull request, and patch, I'm overall happy with the current state of it and feel comfortable continuing to run it on Floofy.tech. I wouldn't /discourage/ using it on your own instance, but just be mindful of the potential risks. If using it, I also highly recommend using it only for the Sidekiq and streaming portions of Mastodon, and create a dedicated Redis instance for caching.