An epic tale of troubleshooting

BadWombat · January 11, 2019, 8:36am

In the hope this might help someone else at some point, I present the tale of how I “broke” home assistant for several days, and eventually figured out that it was all my own fault

The TL:DR version: a malfunctioning sensor delivering hundreds of updates a second was overloading home assistant, and causing 100% disk utilisation on my docker host, interrupting other crucial services. But the cause eluded me for days, because there was little (if anything) in any logs to indicate a problem.

A few days ago I rebuilt some custom sensors (which publish MQTT for various things like door/window opens and environmental factors, using my own firmware). These had all been working in a previous configuration for about 2 years without issue, and the rebuild was mostly to add new sensors for each device. If my problems had started immediately after, I’d have known the cause and fixed it. However, I foolishly did a whole suite of maintenance and enhancement at the same time - new lovelace design, implementing a VPN for remote access, integrating some new docker containers (cloud9, watchtower, etc.).

The next day I noticed my Synology NAS making unusual noises (constant disk writes) and a very slow DSM UI. naturally I went to the various logs, looking for the cause - several kernel processes were writing 1.5 - 2 MB/s, constantly. Also, my home assistant docker container was using a lot of RAM (2-3 GB). In home assistant, there was (initially) nothing at all in the logs. I’m very glad I’d moved off a raspberry pi a few years ago, since this activity would have surely messed up any SD card rather quickly.

I hadn’t upgraded home assistant recently, so I had a pretty good idea that this was something I’d done rather than any “bug”; but I wondered: had I messed up my VPN setup (using Wireguard) and was being hacked? But calmer heads prevailed, and I could see that the problem seemed to be in home assistant exclusively - stop the container and disk utilisation immediately dropped back to normal. But what to do without logs?

Each time I restarted the container I’d see disk usage jump to 100% (and stay there), and RAM use rapidly climb (about 400MB an hour). If I waited too long, restarts from the UI became impossible and I could only stop and start the container in docker - finally I started to see errors for the recorder component in my logs. But it didn’t seem to make any sense - the home assistant database was the same size it’d always been (about 280 MB for my setup), so nothing was “filling up”. Foolishly, I’d forgotten/didn’t understand my purge settings in recorder, which were faithfully working to keep the db in check.

Armed with a little information and no actual understanding of what it meant, I worked systematically through all the things I’d changed. I thought I was being clever, undoing each in reverse order - surely I’d find the culprit quickly. Yet in the end, having undone each of the changes in home assistant, it was still broken (and I’d wasted a large amount of time). I checked the new hardware sensors too, eventually, but initially couldn’t find anything wrong at all - all connections good, code seemed OK, and any new code seemed OK too.

In desperation, I turned home assistant off and pinned my hopes on 85.0 “fixing my woes”, despite knowing full well that it really couldn’t be an issue with HA - no one else was reporting anything similar. When 85.0 arrived, it did not, of course, fix the problem. I backed up and deleted my HA db, and that actually fixed it for a while, at least the disk hammering. But the RAM use was still climbing rapidly.

Finally, I had a breakthrough. Staring at my precious HA dashboard, contemplating how I could ever live without it, I saw a sensor icon appear, briefly, to flicker. Looking directly at it, nothing seemed wrong, but in the corner of the eye it was definitely not stable. Trying to bring up the state history for this binary sensor failed, rendering some weird line graph or hanging the UI entirely.

Firing up Mosquitto, I dug into this sensor’s obscure topic and found the catastrophe: messages streaming in faster than my terminal would display them! To the credit of the protocol, the MQTT broker wasn’t breaking a sweat on this.

In the end, a missing pull-up resistor combined with a long wire to a door sensor was causing the signal to float, and oscillate at a ridiculous rate. Literally, this was the very first thing I’d actually changed in my set up, but because my recorder purge setting was keeping the size of my db in check and there was no error in any logs (except the recorder error itself, which I dismissed as a consequence of killing the HA container from Docker), it was the very last thing I checked.

Lesson: All things being equal, if no one else seems to be having a problem on the same exact version it probably isn’t home assistant’s fault, and make fault-finding easier for yourself by changing one thing at a time. And pull-up resistors aren’t optional, obviously.

tom_l · January 11, 2019, 1:47pm

It a shame that the sense of relief/accomplishment when you finally track down a fault like this doesn’t make all the hair you pulled out in frustration grow back

BadWombat · January 12, 2019, 12:30am

And I have so little to lose…

danielperna84 · January 12, 2019, 12:09pm

Reading your story makes me think it might be a good idea to implement some mechanism into Home Assistant that detects such problems and logs them. I know rsyslog does something like that where it drops log-messages if they exceed a configurable amount within a certain timeframe. They call it rate limiting.
The question would be, if such a rate limiting mechanism should be applied to the whole event bus, or per entity. For the whole event bus probably would be a lot easier to implement, but per entity would make more sense if the goal is to retain stable automations even if a single entity runs amok.

tom_l · January 12, 2019, 2:32pm

A whole event bus warning would at least be something that aids in diagnosis.

Per entity would definitely be better.

How to define what the rate limit should be is the fuzzy bit.

BadWombat · January 13, 2019, 4:06am

In a sense this wasn’t a home assistant problem at all, and maybe it shouldn’t be defended against every silly thing I do it.

But it sure would have been easier to find the issue if it were…