In the hope this might help someone else at some point, I present the tale of how I “broke” home assistant for several days, and eventually figured out that it was all my own fault
The TL:DR version: a malfunctioning sensor delivering hundreds of updates a second was overloading home assistant, and causing 100% disk utilisation on my docker host, interrupting other crucial services. But the cause eluded me for days, because there was little (if anything) in any logs to indicate a problem.
A few days ago I rebuilt some custom sensors (which publish MQTT for various things like door/window opens and environmental factors, using my own firmware). These had all been working in a previous configuration for about 2 years without issue, and the rebuild was mostly to add new sensors for each device. If my problems had started immediately after, I’d have known the cause and fixed it. However, I foolishly did a whole suite of maintenance and enhancement at the same time - new lovelace design, implementing a VPN for remote access, integrating some new docker containers (cloud9, watchtower, etc.).
The next day I noticed my Synology NAS making unusual noises (constant disk writes) and a very slow DSM UI. naturally I went to the various logs, looking for the cause - several kernel processes were writing 1.5 - 2 MB/s, constantly. Also, my home assistant docker container was using a lot of RAM (2-3 GB). In home assistant, there was (initially) nothing at all in the logs. I’m very glad I’d moved off a raspberry pi a few years ago, since this activity would have surely messed up any SD card rather quickly.
I hadn’t upgraded home assistant recently, so I had a pretty good idea that this was something I’d done rather than any “bug”; but I wondered: had I messed up my VPN setup (using Wireguard) and was being hacked? But calmer heads prevailed, and I could see that the problem seemed to be in home assistant exclusively - stop the container and disk utilisation immediately dropped back to normal. But what to do without logs?
Each time I restarted the container I’d see disk usage jump to 100% (and stay there), and RAM use rapidly climb (about 400MB an hour). If I waited too long, restarts from the UI became impossible and I could only stop and start the container in docker - finally I started to see errors for the recorder component in my logs. But it didn’t seem to make any sense - the home assistant database was the same size it’d always been (about 280 MB for my setup), so nothing was “filling up”. Foolishly, I’d forgotten/didn’t understand my purge settings in recorder, which were faithfully working to keep the db in check.
Armed with a little information and no actual understanding of what it meant, I worked systematically through all the things I’d changed. I thought I was being clever, undoing each in reverse order - surely I’d find the culprit quickly. Yet in the end, having undone each of the changes in home assistant, it was still broken (and I’d wasted a large amount of time). I checked the new hardware sensors too, eventually, but initially couldn’t find anything wrong at all - all connections good, code seemed OK, and any new code seemed OK too.
In desperation, I turned home assistant off and pinned my hopes on 85.0 “fixing my woes”, despite knowing full well that it really couldn’t be an issue with HA - no one else was reporting anything similar. When 85.0 arrived, it did not, of course, fix the problem. I backed up and deleted my HA db, and that actually fixed it for a while, at least the disk hammering. But the RAM use was still climbing rapidly.
Finally, I had a breakthrough. Staring at my precious HA dashboard, contemplating how I could ever live without it, I saw a sensor icon appear, briefly, to flicker. Looking directly at it, nothing seemed wrong, but in the corner of the eye it was definitely not stable. Trying to bring up the state history for this binary sensor failed, rendering some weird line graph or hanging the UI entirely.
Firing up Mosquitto, I dug into this sensor’s obscure topic and found the catastrophe: messages streaming in faster than my terminal would display them! To the credit of the protocol, the MQTT broker wasn’t breaking a sweat on this.
In the end, a missing pull-up resistor combined with a long wire to a door sensor was causing the signal to float, and oscillate at a ridiculous rate. Literally, this was the very first thing I’d actually changed in my set up, but because my recorder purge setting was keeping the size of my db in check and there was no error in any logs (except the recorder error itself, which I dismissed as a consequence of killing the HA container from Docker), it was the very last thing I checked.
Lesson: All things being equal, if no one else seems to be having a problem on the same exact version it probably isn’t home assistant’s fault, and make fault-finding easier for yourself by changing one thing at a time. And pull-up resistors aren’t optional, obviously.