Memory exhausted by python3 process - process/container crash

Several times in the past week (and from my impression it was often not long after triggering an upgrade if that may be related), my recent install of Home Assistant has locked up due to exhausting the memory. RPi3 only has 1 GB RAM, and for some reason the HassOS image only comes with 200 MB swap by default.

Due to the poor logging with this setup I have not been able to see what’s going on when the memory is all gone. Even though I’ve resorted to keeping a separate PC always on to stream certain logs over ssh, it still eludes me. Top shows the python3 process memory utilization spontaneously and quickly grow out of control. dmesg just shows a notice when python3 memory allocation fails. home-assistant.log with default logging enabled shows nothing out of the ordinary. No other log showed any indication of problems. I will try to enable debug logging for homeassistant.core next.

Now I have also added a 4 GB file-based swap using mkswap and swapon. If this is used a lot it may cause the system to grind to a snails pace, but if it’s fleeting then it’s better than locking up. Time will tell if the same high memory utilization will result in a lock-up or just a temporary high-water mark.

At this point, with the lack of logging, HassOS seems stripped almost beyond usefulness. That Home Assistant removing its own log file each restart is also a weird design choice and makes debugging crashes or lock-ups harder than necessary.

  1. Are there any established ways to troubleshoot memory exhaustion problems?

  2. How can I make the swap file automatically get loaded at host OS start-up?

An update with good news and bad news from this morning.

The system once again at around midnight local time decided to consume a bucketload of system memory.
The good news: with the increased swap space, the RPi itself still lives.
The bad news: the python3 process running homeassistant died.

The “homeassistant” container has crashed.

Looking at the home-assistant log (/mnt/data/supervisor/homeassistant/home-assistant.log from HassOS) the final entry is at midnight local time (15:00 UTC) and this is reflected in the last-modified timestamp of the file.

2020-06-03 00:00:18 WARNING (zeroconf-Engine-1848242208) [zeroconf] Exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/zeroconf/__init__.py", line 1292, in handle_read
MemoryError

Even though the MemoryError is written here, and the process doesn’t fully die until an hour later, the high memory consumption and CPU usage spikes about an hour prior.

At around 13:55:11 UTC is where memory and CPU utilization begins to climb.

With some debug logging enabled, I also see that even “DEBUG” log entries from homeassistant.core cease. Final DEBUG entry is at 22:55:01 local time (13:55:01 UTC) just prior to . The debug entries are only regular sensor update information, “Bus:Handling <Event state_changed[L]> …”.

Over the following hour, for “python3 -m homeassistant --config /config” the CPU is pegged at 100% and memory utilization climbs from a typical mere 335 MB VIRT (200 MB RES) to 1169 MB VIRT (~700 MB RES) in just 5 minutes (13:59:52 UTC). CPU utilization hangs back a bit (possibly because the SWAP isn’t being hammered.)

It kicks up again at 14:05:27 UTC, with 100% CPU and gradually climbing memory usage. It plateaus again at 14:06:33 UTC with 1384 MB VIRT (740 MB RES).

This remains stable until 14:19:12 UTC, going up further to 1652 MB VIRT by 14:20:34 UTC.
Rinse and repeat at 14:37:15 UTC, climbing to 1987 MB VIRT by 14:39:02 UTC.
15:00:02 UTC we climb again, to 2029 MB VIRT by 15:00:20 UTC. This is around the time that MemoryError happens.
The python3 process lives for another hour at the same memory utilization and approximately 15% CPU utilization on average. Then it dies and goes away.

How in the world am I supposed to troubleshoot this further?!

Reported an issue with the home-assistant/core project. Hope to get some traction there for further troubleshooting.

Perhaps this post should be moved to the Home Assistant forum instead of Configuration?

Possibly related topics:

@basnijholt posted Whole Home Assistant + OS periodically becomes irresponsive

@Guyanthalas posted HA dies after several days

@AlmostSerious posted Python3 high CPU Usage

Less clear but still possibly related:

@jimford posted "Update Available" - No thank you very much!

1 Like

I think that this issue home-assistant/core#35840 is also the same.

It’s difficult to say if zeroconf was the cause of the rapidly growing memory utilization an hour before crash, when the core stopped printing debug messages. Also impossible to say whether the zeroconf log entry prior to MemoryError was related or not, or simply a coincidence.

Some more curious details.

I recently added some virtual sensors to see what can be captured during these memory exhaustion events. Turns out, not much.

My sensor for host OS memory usage?
Memory used %: final entry reads 38.3% at 22:53:51 – 70 seconds before memory starts to climb
Memory free: final entry reads 597.1 MB at 22:54:53 – 18 seconds before

Ping Google DNS?
Set to fire off 4 ping every 60 seconds.
Keeps recording data regularly up until 22:54:41 – 30 seconds before

Ping my local router?
Set to fire off 5 ping every 30 seconds.
Keeps recording data regularly up until 22:55:01 – 10 seconds before

During the hour of steady memory growth and no activity, the homeassistant.core (MainThread) debug messages have ceased. The only messages shown in the log still are from pychromecast.socket_client.(Thread-XX, multiple entries), and it only says “Heartbeat timeout, resetting connection” followed by “Connection reestablished!”, for all my Google Cast devices in unison. These show up 4 times, with about 15-25 minute interval.

These are all the logs I got out of the system last time it happened.