HA (@Pi4 4GB) dying, all RAM used up

I have HA running on a Pi4, 4GB.

After a recent series of power cuts, HA “freezes” e.g I can ping, but not ssh.

If I hard reset - it remains stable for < 30 minutes

‘top’ shows all ram being used up. After 30 minutes there was about 40MB free. Right now I am watching it, it’s been about 10 minutes since a hard reset, and Used Mem is: 2.2GB, free is 1.7GB

I’ve been through the logs on the HA website, but don’t see anything obvious.

How do I diagonse and find out what is eating all the ram?

I have tried pressing M in top…but it appears this version of top is a cut down version? It doesn’t seem to show processes by memory usage.

There is an integration called “Profiler” that may help diagnose this.

Also search your automations and scripts for repeats that may not exit.

Thankyou so much Tom.

Most of my house runs on HA now, including heating, and I live in a cold area, so when HA fails it’s a true problem.

I’m no expert, a bit unsure how to use it, but I’ve just installed Profiler, and set up an automation triggered by sensor.memory_free dropping below 1GB, that will run SERVICE PROFILER.MEMORY and SERVICE PROFILER.START_LOG_OBJECTS, hopefully that will give me the data required.

Will check repeats that may not exit, thanks for the pointer.

Unfortunately, I do not understand how to evaluate the results. Could you give an advice, please?

@pedolsky Do you have a similar issue?

So it seems like you’ve found a memory leak. There was one in 2022.4 and balloob posted instructions on how to get it fixed here. Please follow those instructions to collect a memory profile and create a new issue on github with those details so we can get it sorted.

@CentralCommand Thanks Mike, will read and do.

Just fyi: am running current version of HA (2022.5), and I have attached a graph - note the timestamps - you can see that when the memory leak occurs, it occurs very quickly - i.e not a gradual / slow memory leak over time.

Interesting, that certainly is a different profile then I would expect. But still if it builds until an OOM crash it seems very likely to be a memory leak, that should never happen. If it just builds and holds then that could be different. I would think 4gb is enough but I guess it might depend on what exactly you’re running.

Even if its not a memory leak the tools in the linked post should help you see where the memory is going. pyspy in particular seems to have a graphical tree view that others have been screenshotting. You can use that to decide if you think its a bug or if you need to cut some integration or something out of your system for being a memory hog.

Since I have a very basic setting, I find a 34% memory rate to be high. My real problem is the punctual high processure use (from 2 or 3 up to 71%), combined with a steep rise in temperature of cpu and processor (from 30 up to 65°C).

Thanks. Yes it’s an issue: HA (quite quickly) becomes slow, slower, then unresponsive. e.g an ssh connection might take a minute (!) - and ssh becomes almost like watching a pre-internet 14.4k modem BBS screen loading…

all automations stop or are extremely delayed, e.g motion activated [lights|heating] stop working.

FYI Setup is fairly minimal: no cameras, 2x Zwave devices, 14x zigbee, 2x google home, 1x Roku, Philips Hue.

Found the issue. Notes:

  • Memory leak is 140MB/min
  • CPU jumps from ~1% to 25% during leak, and remains at 25% “forever”
  • Mem leak is caused by a (my) badly programmed Repeat Loop: (no Action Sequence in Loop)
  • The “Until” condition looks fine, loop “should” finish ok, but does not
  • Mem is not freed via turning the automation off/on - reboot required

Description: I have a reed switch on the front door, if the door is left open too long, then the heating system is turned off, and turned back on once the door has been closed for 3 minutes. The problem is the Repeat section, as @tom_l suggested.

I had used a repeat loop, where I should have used a wait for trigger. Likely cause: creating an automation too late at night.

However: I doubt such a severe memory leak, and high CPU usage, is the ideal response to a (badly programmed?) Repeat section - the entire Home Assistant shuts down, and the user can not even restart it via the web interface - pulling the power cable is required.

The automation was intended to “wait” for the door to be closed, before turning the heaters back on. Here is the offending section, with **pseudocode for the other sections for brevity:

**Trigger: front door opened (reed switch)
**Action1: Turn off heaters
repeat:
  until:
    - type: is_not_open
      condition: device
      device_id: 4de50e1b1e70e7e80176a1dc763eb3e5
      entity_id: binary_sensor.reed_front_door
      domain: binary_sensor
      for:
        hours: 0
        minutes: 3
        seconds: 0
  sequence: []
**Action2: Turn heaters back on

As you will note, there is no sequence in the repeat loop. I think my internal logic at the time was simply “just do nothing, until the door has been closed for 3 minutes”. This caused high CPU usage, and a drastic memory leak (8.4GB / hr!)

(FWIW: also the above doesn’t work, i.e even when door is closed, the until condition is never actually satisfied)

“Obviously” a better way to do that is to use wait_for_trigger and not a repeat..until loop:

wait_for_trigger:
  - type: not_opened
    platform: device
    device_id: 4de50e1b1e70e7e80176a1dc763eb3e5
    entity_id: binary_sensor.reed_front_door
    domain: binary_sensor
    for: 00:03:00

What were those quotes about “try and make something foolproof…and someone just invents a better fool”? or “people will always try and use things in ways you never intended”? Looks like I am guilty of both :slight_smile:

Regardless, HA is hard for newbies. And we should not assume everyone is a programmer. So, I dont feel the behaviour (catastrophic OOM failure) is appropriate, suggest either just dissallow an empty sequence, or, if the devs wish to allow an empty sequence, then indeed just have it actually “do nothing” until the “repeat…until” condition is met.

Thankyou!

Should I file a bug report for this?

Probably not. It would be very difficult to check for all the ways a loop could get out of control. It’s up to you to use this powerful action responsibly. See the link I posted earlier.

Thing is, the logic of the loop does seem correct? (albiet definitely not “best practice”).

“do nothing Until door sensor Is Not Open for 3 minutes”. Logic is sound?

The issue is no Sequence, where as the link you shared, the issue is the an Until condition that is never satisifed, thus in that case “justifiably” creating an infinite loop?

I’d like to share a second point: A lot of people struggle with HA - it can be hard for non developers. Lots of people own houses, and lots of people are not dev’s.

If you agree the logic is sound, and therefor someone else may also do it the same way, then having HA fail catastrophically with a 8gb/hr memory leak, would like seem like good design?

A (construction) nail gun is also a powerful tool, that must be weilded responsibly, however there are still safety mechanisms in place, that require it to be pressed to a surface, before the trigger can be acuated.

Given that Home Assistant is also crossing over into the Physical Realm, perhaps safety should also be a prority? There is a difference between "“my computer crashed and now I can’t load facebook” and “my grandma fell down the stairs and broke her hip, because the automatic lights didn’t come on due to a memory leak”.

No matter how many safety mechanisms you put in place someone will always find a way to override them.

I can’t think of a use case for an empty sequence in a loop. So sure submit an issue for that case.

It won’t stop people creating non empty infinite loops, but it’s something.

1 Like

ok & thankyou tom