Whole Home Assistant + OS periodically becomes irresponsive

basnijholt · June 1, 2020, 6:17pm

I have HA OS running on a NUC on Proxmox.

Very frequently (about every few hours) my systems locks up. HA becomes irresponsive, see e.g., my CPU sensor in Grafana:

Here I see that at some point HA stops registering the sensor and then after about ~15 minutes, it starts again, usually coming down from a high CPU usage. This makes me think that there is some process (maybe an add-on) that spirals out of control and then recovers. While HA is responsive, ssh also becomes irresponsive, so I cannot look at top for example.

I have no (simple) idea of how to debug this because the log files (both supervisor and core) show nothing of relevance. Does anyone have any suggestions?

Version info:
Supervisor 225
System HassOS 4.8
HA 0.110.4
My configuration files https://github.com/basnijholt/home-assistant-config/

salonluden · June 1, 2020, 6:27pm

I have been having the same problem as well. Granted, I am running off a RPi 3B+, but it didn’t happen until I updated to 0.110. I hope this gets fixed soon.

basnijholt · June 1, 2020, 6:46pm

Ah, good to hear that it’s not just me.

I am not entirely sure whether this happened for me since 0.110 though. I have been having issues for a few weeks and recently moved from Ubuntu to a Proxmox setup because of the (reverted!) deprecation announcement. So I am not sure whether it’s because of that or some HA bug.

balloob · June 1, 2020, 10:12pm

Home Assistant is built around an event loop. This means that there is always only a single task running at the core of the system. When a task needs to do I/O, they schedule an I/O task and suspend themselves until the I/O is done.

I don’t know if it is the cause here, but one reason for Home Assistant not acting for several seconds is if an incorrectly coded integration is doing I/O inside the task, instead of scheduling an I/O task. Now the whole event loop blocks and no other tasks are processed until the I/O is done.

A good first start is to make sure you don’t have any warnings about I/O in the event loop in your logs. If you do, get those fixed should be step 1.

basnijholt · June 2, 2020, 9:14pm

Thanks @balloob! However, I do not think it is merely the event-loop getting blocked. Because not only does Home-Assistant become irresponsive, so do all add-ons.

So ssh doesn’t work, glances stops reporting (preventing me from finding the culprit), and I am even unable to get into the Proxmox image.

When everything unblocks I see that a Python process crashed:

Anyone knows how I can find out who owned that process?

It seems like a Python process has a memory leak which also uses 100% CPU.

Nonetheless, I have fixed all warnings.

AlmostSerious · June 2, 2020, 9:57pm

Might not be related at all, but i had similar issues. For me 2 things helped.

I disabled zeroconf. That already helped a lot in bringing the python process down a notch in cpu usage.
I disconnected my vlans from the host. As soon as my host is connected to more than one subnet, homeassistant uses ALOT more CPU.
After doing 2. I re-enabled zeroconf and had no issues anymore.

logan893 · June 3, 2020, 8:20am

Possibly similar to the issues I’m having.

I raised an issue with the home-assistant core just now.

github.com/home-assistant/core

MemoryError and crash of python and homeassistant docker container when using kef integration

opened 08:00AM - 03 Jun 20 UTC

closed 05:35PM - 08 Sep 20 UTC

logan893

integration: kef

## The problem  The python3 process running homeassistant died. The “homeassistant” container crashed. Looking at the home-assistant log (/mnt/data/supervisor/homeassistant/home-assistant.log from HassOS) the final entry is at midnight local time (15:00 UTC) and this is reflected in the last-modified timestamp of the file. ``` 2020-06-03 00:00:18 WARNING (zeroconf-Engine-1848242208) [zeroconf] Exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/zeroconf/__init__.py", line 1292, in handle_read MemoryError ``` Even though the MemoryError is written here, and the process doesn’t fully die until an hour later, the high memory consumption and CPU usage spikes about an hour prior. At around 13:55:11 UTC is where memory and CPU utilization begins to climb. With some debug logging enabled, I also see that even “DEBUG” log entries from homeassistant.core cease. Final DEBUG entry is at 22:55:01 local time (13:55:01 UTC) just prior to . The debug entries are only regular sensor update information, “Bus:Handling <Event state_changed[L]> …”. Over the following hour, for “python3 -m homeassistant --config /config” the CPU is pegged at 100% and memory utilization climbs from a typical mere 335 MB VIRT (200 MB RES) to 1169 MB VIRT (~700 MB RES) in just 5 minutes (13:59:52 UTC). CPU utilization hangs back a bit (possibly because the SWAP isn’t being hammered.) It kicks up again at 14:05:27 UTC, with 100% CPU and gradually climbing memory usage. It plateaus again at 14:06:33 UTC with 1384 MB VIRT (740 MB RES). This remains stable until 14:19:12 UTC, going up further to 1652 MB VIRT by 14:20:34 UTC. Rinse and repeat at 14:37:15 UTC, climbing to 1987 MB VIRT by 14:39:02 UTC. 15:00:02 UTC we climb again, to 2029 MB VIRT by 15:00:20 UTC. This is around the time that MemoryError happens. The python3 process lives for another hour at the same memory utilization and approximately 15% CPU utilization on average. Then it dies and goes away. https://community.home-assistant.io/t/memory-exhausted-by-python3-process-process-container-crash/201011/2 ## Environment  - Home Assistant Core release with the issue: I've had similar memory exhaustion issues with home assistant since first installing. 0.109, and several 0.110.x. Lack of information in logs makes it impossible to determine the cause. Most recently 0.110.4, with HassOS 3.13. - Last working Home Assistant Core release (if known): N/A - Operating environment (Home Assistant/Supervised/Docker/venv): Raspberry Pi 3, HassOS based installation from recommended 32-bit image - Integration causing this issue: N/A - Link to integration documentation on our website: N/A ## Problem-relevant `configuration.yaml`  First happened with default configuration.yaml, only loaded my Google Home units (4x Home Mini, 1x Nest Mini, 1x Chromecast, 1x Chromecast Audio, 1x JBL speaker) and Philips Hue bridge with one light. Since then I've added more, but with the lack of logs I cannot say if it's related to any specific configuration or the base Home Assistant core system. ```yaml ``` ## Traceback/Error logs  DEBUG was active on homeassistant.core and it stops producing any logs about an hour prior to MemoryError being output into the home-assistant.log. See the detailed flow of events above. The memory utilization of python starts to grow at the same time as the homeassistant.core DEBUG logging stops. ``` MemoryError ``` ## Additional information I've tried to collect as many logs as possible but still cannot see what is triggering this runaway memory usage which results in a crash.

basnijholt · June 3, 2020, 9:25am

Great! I am happy that it might not be a one-off problem.

Like you did, I will stream the log file over ssh and see whether I can find that zeroconf warning.

Bigrob8181 · June 3, 2020, 10:00am

I noticed something similar this last week. Maybe related? I’m running hass supervised on a nuc. I noticed my nuc fan was full speed following a restart and would not come down, temp was spiking. Upon investigation the python3 process for the supervisor was using 100% cpu and not going down. Since I’m in a docker, I restarted the supervisor and everything went back to normal. No idea what caused the problem as I didn’t see anything in the logs, but it seemed to be a pretty good bandaid.

AlmostSerious · June 3, 2020, 10:10am

Just leaving a link to my original post with what i believe to be the same error. Python3 high CPU Usage
What you could do is run PySpy to analyze the Python Process.

basnijholt · June 3, 2020, 12:29pm

To summarize (mostly @logan893’s findings) :

GitHub issues:

Possibly related topics:

@logan893 posted Memory exhausted by python3 process - process/container crash
@basnijholt posted Whole Home Assistant + OS periodically becomes irresponsive
@Guyanthalas posted HA dies after several days
@AlmostSerious posted Python3 high CPU Usage

Less clear but still possibly related:

@jimford posted “Update Available” - No thank you very much!