Detect Out-of-Memory condition in Hassos

colohan · June 30, 2024, 3:30am

I’ll admit, I have a somewhat unusual Home Assistant configuration: >100 loads (Vantage Lighting and Zwave), as well as using many integrations. But, I’ve always managed to get away with having my system all work just fine on a Raspberry Pi 3, with the recorder sending all data to MariaDB on a Synology. It is a testament to how efficient and robust Home Assistant is. But…

For the past 6 months or so I’ve been having Home Assistant repeatedly disconnect from my Elk-M1 system. (Which is connected via a serial-to-USB connection.) And when it disconnects, I need to reboot Home Assistant to get it working again – which was annoying, but fine.

But… over time, the problem got worse. And worse.

So I decided to start debugging it. But I found that the debug statements in the Elk code were not enough for me to figure out what was going on. So I figured out how to log into the HassOS container, and modify the source to add more debug log messages.

And in the process, I noticed that the system was really slow. And frequently would hang a bunch.

Uh oh. Is my SDCard failing? Rather than get a new SDCard, I decided it was finally time to upgrade my Pi3 to something better. So I bought a Pi5 system.

And it was only once I brought up the new system from a backup and looked at my System Monitor graphs to see how it was doing that I realized what was going on.

The memory use of Home Assistant have apparently gone up over time with new code. And I had finally reached the point where my Pi3 was constantly swapping. And the periods of extreme lag I was encountering (which also were killing my Elk-M1 connections) were when the swap process was going nuts trying to free up pages. And since my Pi5 has 8X the memory of my Pi3, I no longer have any issues.

Which brings me to my feature request: add a feature which monitors swap usage, and when it goes too high for too long, generate a notification. Tell users “hey, you seem to be running low on memory, perhaps you should consider fixing this”?

(I guess I could also file a bugreport on the Elk-M1 code for not properly reconnecting after it gets unsynced, but it is better to just avoid getting in that state in the first place…)

^ Guess where in this graph I upgraded the hardware?

colohan · June 30, 2024, 3:32am

FWIW, on the Pi5 it is now zippy fast, so I guess Home Assistant retains its crown of being quite efficient – even if it is a wee bit bigger than it was before.

tom_l · June 30, 2024, 3:46am

- id: 7511bebc-6488-4526-ae81-db5c1de58b80
  alias: 'Memory Monitor Alert'
  trigger:
    platform: numeric_state
    entity_id: sensor.memory_use_percent
    above: 75
    for:
      minutes: 5
  action:
  - service: notify.your_notification_here
    data:
      title: '⚠️ <b>High RAM use</b>'
      message: "75% of host memory resources used."

colohan · June 30, 2024, 4:15am

Excellent. I am aware that creating such an alert is quite easy. (I think that is what you are pointing out with your code? Or are you saying that such an alert is already in place?)

I am suggesting that as a feature request we add something like this to the list of alerts that the system comes with by default, since someone is not likely to set up something like this until after they have been burned by it like I was…

ShadowFist · June 30, 2024, 1:20pm

Only about 8% of users are running HA on a Pi 3 or Pi 2. Running on such low-specced hardware means you have to be on top of monitoring resources. Bear in mind those 8% might not have hundreds of devices such as is your case, so they won’t run into such issues.

Given the simplicity of setting up system monitor, the automation @tom_l kindly posted and the low amount of affected installations, do you still feel that this should be a default alert which merits a feature request?

colohan · June 30, 2024, 3:06pm

Short answer: yes.

Slightly longer answer: If there already exists a place where default alerts like this would go, then toss this one on the pile. It is really easy, and it can save some potential pain for 8% of users in the future. And it could cut down on hard-to-narrow-down requests for support and bugreports.

Very long answer:

Your userbase can be roughly divided into two categories: enthusiasts, and users. I am a former enthusiast – I actively worked on developing the Vantage integration I’m using (Greg Badros did most of the work, I just adapted it to my weird system). When I built my system, the Pi3 was what was recommended to use (the Pi4 was reputed to not be fully stable with Home Assistant at the time…). And one of the main selling points of Home Assistant was that you can “set it up and forget about it”, and it is reliable and functional, even if your internet connection fails.

These days, I have a fully-functional system, and what I wanted was it to continue to be fully functional without my messing with it. I’ve transitioned to being a “user”. A quite happy one.

As someone who has worked in security, I’m used to applying patches. Home Assistant ships patches – so I apply them. And, apparently, one of those patches I applied had memory bloat. Enough bloat to break what was a fully-functional system. Even worse, the bloat broke my system in a way that was hard to spot unless I was looking for it. Instead of just breaking things, or throwing errors, it made what was a reliable system unreliable.

The only reason I figured out what was wrong was that I went into developer-mode and managed to find the bug. This is the reaction of an enthusiast (like you!). Most users, hit with this class of failure, would conclude that “Home Assistant has become unstable, I should warn my friends off of using it!”

So perhaps my request should be rephrased: I discovered that when Home Assistant ships a memory bloat bug (and such bugs are inevitable), the end-user experience is bad. And bad in a way which will only be figured out by quite sophisticated users.

So the three possible responses that I can think of are:

We should prevent memory bloat from shipping. Create some test suite that tries to detect this, and stop shipping if it is detected. I’ve worked with teams that have attempted this, it will only work to catch extreme bugs which probably would have been caught before shipping anyways.
We should make it easy for users to detect and respond to memory bloat. That is what I’m advocating for here. As illustrated above, this seems like an easy fix.
It is not worth fixing. It is either too hard to fix (for example, there is no good place to put the code posted above, and creating such a place would result in a cesspool of rarely-tested half-broken alerts), or we don’t expect memory bloat bugs to be common enough to matter in the wild, or we don’t care about the class of users who encounter them.

I’m not an active developer of this system, so I have no skin in this game. If you think this is case 3 – I defer to your judgement.

ShadowFist · June 30, 2024, 3:45pm

Tbh, I agree more with 2 rather than 3. What you might fail to grasp is that your proposal will require shipping system monitor as a default config for all users.

Bear in mind that certain installation types might not handle some system monitor entities out of the box (CPU temp & battery on a proxmox install of HA in my case).

In recent months HA has been moving towards identifying integrations which cause thread exceptions. This month’s release makes it even easier to identify such instances. Maybe this out of the box functionality which already exists is enough to identify such issues for users with lower spec hardware.