Hi everyone,
I am running Hass.io on a vanilla odroid N2+. It did run fine ant 2-3% processor usage, 23-30% memory fill and zero swap. However since updating to 2022.4 the memory kept rapidly filling, until swap was exhausted and the system did reboot. I saw that there were a lot of complaints about this behavior and that a new caching scheme was introduced. The moderators answered that this consumes more memory but that memory fill is expected to level off after some hours. But it didn’t in my case.
After the update to 2022.5 the behavior changed. Now the quick memory fill stops at 2-3 hours after restarting HomeAssistant core. BUT after that point, the memory keeps filling at a constant rate, slowly filling the entire physical memory and into swap, again.
Some contributions theorized that possible one of the integrations might cause the memory to fill. But as it is a live system. I cannot arbitrarily disable/enable integrations to determine if disabling one of them cures the problem. The custom components I use are dwd-weather, dwd_weather_warnings, ltss, teltonika, and sms77_io.
Would it be possible to make either disable the new caching scheme or make it prevent itself from making the system crash by running out of memory?
Can you point me to what you’re referring to here:
I’m not aware of these complaints or this new caching scheme and I don’t see anything in the release notes about it. If you can be more specific about what changed that you believe affected you it could help.
I mean this is going to make it very difficult to figure out what happened. It sounds like something you’re using has a memory leak. Would it not be better to disable and enable to find out what exactly is broken so the rest of your system can be fully stable instead of having an unstable system that crashes for a while until maybe someone else happens to stumble across the same issue?
High memory usage and a new caching scheme are discussed in this thread: Memory leak after updating to 2022.4?
My memory graphs looked pretty much the same after upgrade to 2022.4.
A memory leak in the few custom components I use is not very probable as they worked perfectly file until the upgrade to 2022.4. And in the thread cited above it is mentioned that the higher memory usage is intentional but should level off. But doesn’t it doesn’t in my case.
I there any way to monitor what is in this cache? Knowing this might be an easier way to hunt down the reason for high memory usage.
Right I am remembering this thread now. Bdraco mentioned the cache changes at the beginning because a change in the memory use profile was expected. But steadily increasing memory until a crash is not related to that, it’s a memory leak.
In the issue that spawned from that thread the source of the memory leak was tracked to the stream component. It was unrelated to the new cache. Later a second issue was opened which it appears was not resolved yet. Perhaps you are facing the same.
Balloob detailed out how to get these kinds of issues resolved here. You should follow his instructions and collect a pyspy dump and then include that either in the second issue (if you think its related) or a new issue.
There is no way to disable the cache but just an fyi you aren’t currently presenting any evidence that the cache is in any way involved. It wasn’t in the issue you linked and it probably isn’t here. ~1000 commits went into 2022.4, any one of which could’ve accidently introduced something that caused a memory leak. More if you also add the commits to the custom components you use. The only way to track down a memory leak is with a memory profile. Or disable things until you find the source but that still leaves a lot of space to investigate for someone and isn’t an exact science.
Many thanks for your suggestions.
I just found out, it has a completely different reason. A usb-to-serial adapter (controlling a power distribution unit, PDU) directly plugged into the system running hass.io had died. It must have stopped working within hours from the 2022.4 update. I realized it, when I tried to cycle one outlet of this PDU.
The dead adapter caused the command_state command line to stall (echo ... > /dev/ttyUSB0 && od ... < /dev/ttyUSB0), creating thousands of processes, over time. Those were not visible in the Home Assistant command line, just in the hypervisor command line. Replacing the adapter and a reboot solved the problem.
What still bothers me: As far as I understand the documentation of command-line switch, the default timeout of each command is 15 sec. Apparently, in my case the command did not time out.
Is it because I address hardware directly (/dev/ttyUSB0)?
So HA just runs the command with subprocess.check_output. According to that when the timeout expires it kills the child process and returns an exception.
That being said, some searching turned up this so it may not be so cut and dry. If you’re comfortable in python you could probably test this yourself by opening a process and waiting for it with a timeout in the same way, see if it’s killed. In which case HA could possibly take the approach suggested there for running commands. Although to be honest it seems like a bug in the subprocess module to me.