Whole Home Assistant + OS periodically becomes irresponsive

I have HA OS running on a NUC on Proxmox.

Very frequently (about every few hours) my systems locks up. HA becomes irresponsive, see e.g., my CPU sensor in Grafana:

Here I see that at some point HA stops registering the sensor and then after about ~15 minutes, it starts again, usually coming down from a high CPU usage. This makes me think that there is some process (maybe an add-on) that spirals out of control and then recovers. While HA is responsive, ssh also becomes irresponsive, so I cannot look at top for example.

I have no (simple) idea of how to debug this because the log files (both supervisor and core) show nothing of relevance. Does anyone have any suggestions?

Version info:
Supervisor 225
System HassOS 4.8
HA 0.110.4
My configuration files https://github.com/basnijholt/home-assistant-config/

I have been having the same problem as well. Granted, I am running off a RPi 3B+, but it didn’t happen until I updated to 0.110. I hope this gets fixed soon.

Ah, good to hear that it’s not just me.

I am not entirely sure whether this happened for me since 0.110 though. I have been having issues for a few weeks and recently moved from Ubuntu to a Proxmox setup because of the (reverted!) deprecation announcement. So I am not sure whether it’s because of that or some HA bug.

Home Assistant is built around an event loop. This means that there is always only a single task running at the core of the system. When a task needs to do I/O, they schedule an I/O task and suspend themselves until the I/O is done.

I don’t know if it is the cause here, but one reason for Home Assistant not acting for several seconds is if an incorrectly coded integration is doing I/O inside the task, instead of scheduling an I/O task. Now the whole event loop blocks and no other tasks are processed until the I/O is done.

A good first start is to make sure you don’t have any warnings about I/O in the event loop in your logs. If you do, get those fixed should be step 1.

1 Like

Thanks @balloob! However, I do not think it is merely the event-loop getting blocked. Because not only does Home-Assistant become irresponsive, so do all add-ons.

So ssh doesn’t work, glances stops reporting (preventing me from finding the culprit), and I am even unable to get into the Proxmox image.

When everything unblocks I see that a Python process crashed:

Anyone knows how I can find out who owned that process?

It seems like a Python process has a memory leak which also uses 100% CPU.

Nonetheless, I have fixed all warnings.

Might not be related at all, but i had similar issues. For me 2 things helped.

  1. I disabled zeroconf. That already helped a lot in bringing the python process down a notch in cpu usage.
  2. I disconnected my vlans from the host. As soon as my host is connected to more than one subnet, homeassistant uses ALOT more CPU.
    After doing 2. I re-enabled zeroconf and had no issues anymore.

Possibly similar to the issues I’m having.

I raised an issue with the home-assistant core just now.

1 Like

Great! I am happy that it might not be a one-off problem.

Like you did, I will stream the log file over ssh and see whether I can find that zeroconf warning.

I noticed something similar this last week. Maybe related? I’m running hass supervised on a nuc. I noticed my nuc fan was full speed following a restart and would not come down, temp was spiking. Upon investigation the python3 process for the supervisor was using 100% cpu and not going down. Since I’m in a docker, I restarted the supervisor and everything went back to normal. No idea what caused the problem as I didn’t see anything in the logs, but it seemed to be a pretty good bandaid.

Just leaving a link to my original post with what i believe to be the same error. Python3 high CPU Usage
What you could do is run PySpy to analyze the Python Process.

To summarize (mostly @logan893’s findings) :

GitHub issues:

Possibly related topics:

Less clear but still possibly related: