HomeAssistant stops functioning after about one hour

Running latest 2022.5 version on Docker on Ubuntu x86 computer.

When (re)starting the docker container, everything works great, all works and everything.
After some time, about an hour or so, various. things stop working:

  • HomeAssistant is not accessible through :8123 port
  • I’ve also added HealthChecks.io support - health checks are stopped being fired.
  • Automations stop working

On the other hand - some other things keep “working”:

  • Docker container is alive and well
  • Logs of background services are still being written (i.g. INFO (Thread-14) [pychromecast.socket_client] [Entryway clock(192.168.1.123):8009] Connection reestablished!)
  • Docker container is accessible, running ps -ef returns 191 root 39:39 python3 -m homeassistant --config /config

This started sometime with 2022.5 (not sure exactly when).

I’ve tried to change the logs to debug (all of them, via `logger: default: debug") – but no correlated logs appear at that time, it looks like everything works well.

How can I debug the issue?
Anyone experienced something similar?

Some extra data if needed:

Version	core-2022.5.4
Installation Type	Home Assistant Container
Development	false
Supervisor	false
Docker	true
User	root
Virtual Environment	false
Python Version	3.9.9
Operating System Family	Linux
Operating System Version	5.15.0-27-generic
CPU Architecture	x86_64
Timezone	Asia/Jerusalem

How are you checking this issue? In browser, through app, both?

If you have Linux, you may try nmap haserverip
This will list all available ports and might tell something about problem

I use portainer. Portainer allows access to live log of docker container. This would be good for checking this problem

Could this be memory issue or storage problem.
I’ve seen similar issue on raspberry pi when sd card storage is failing.

Have you reboot host server?
I’ve seen host server updates create havoc with HA docker container. Rebooting twice(not sure why) resolved this. As a rule I expect updates on host kernel to cause this and usually notice it immediately.

Have you tried go back yo older HA version? Does this change issue. Would help verify if it’s version related

I use both web browser and curl calls from linux itself (using curl) - it times out.

Running nmap doesn’t show that 8123 port is open.

Running Portainer - the logs correspond with the home-assistant.log file, but another interesting thing - looking at stats I see:

Meaning – something is working in the background.

There’s plenty of memory and storage so this is not a problem.
In addition, I’ve rebooted multiple times and still it doesn’t work.

Didn’t try to revert and go back to 2022.4 for example, but I thought that it worth keeping these settings for reproduction untill it’s solved.

Any other ideas?

I’ve had the same issue with a integration that was choking.
HA received so many updates from the sensors of that integration that it stopped working after x time.
For me it was about 12~18 hrs.
Perhaps you could disable some integrations or automations to see if any of them is causing this issue.

It could be a simple bad logic flow in an automation causing this.
I recently had an issue with an automation for a phone charger that tried to switch it on constantly until the device was detected as charging. It would eat up all the allocated RAM at a rate of about 1Gb/10 mins and render my HA inaccessible within an hour, but seemed to half limp along. Like a zombie :smiley:

(my embarrassing example: Memory leak when using Google Calendar and location in an automation - #11 by callumw)

So with that in mind, I’d recommend restarting HA, de-activate all automations, restart again and monitor it. If it’s stable then restart them 1/2 at a time until you hone down what one may be causing the hang-up.
If disabling all the automations don’t resolve it then you at least know they aren’t the cause and you can try disabling the integrations to see if a device is the source of the problem

Have you checked system logs? sudo dmesg and /var/log/syslog

From within the container or on the host?

On the host.

I’ve had some issues with Docker containers on Synology that stop responding (or rather, stop accepting network connections). When it happens, there are some messages being logged about TCP socket memory being depleted.

Sounds to me like a deadlock. I had very similar behavior when writing an integration and improperly using locks.

Actually I’m suprised I didn’t think of it myslelf.
The issue started after I added a new integration. Disabling it solved the issue. Thanks everyone

2 Likes

That’s interesting, how did you find it’s related to Deadlocks?

@robertklep – looking at sudo dmesg and /var/log/syslog I couldn’t find any event that correlates with the issue

If it’s caused by a deadlock, you wouldn’t :slight_smile:

I was developing a custom integration and was using Python locks to try to coordinate threads. Got some reports of Home Assistant locking up (eg UI stopped responding, required HA to be restarted) - since this started occurring immediately after that change - I suspected it. Also I’ve seen so many deadlocks in my career - I was pretty sure that was the culprit. I reworked the app to avoid needing a lock at all.

We may want to do a code search on that integration looking for locks or other multi threading constructs. What integration was it?

It is a custom integration I am responsible of.

Guy reached out, I fixed the issue.

indeed a deadlock without a clear reason.

I removed the mutex and added ‘PARALLEL_UPDATES=1’ to let HASS prevent multiple access to the API.

Will continue investigating this weird case, I am not sure how HASS infrastructure access our API objects and whether it’s from different threads or different workers.

1 Like

It can and does. Be glad to look at it. For me what I saw doing callbacks into

schedule_update_ha_state()

With a lock, and then having HA call into entity methods / properties in parallel that would use the lock, because HA has some internal locks it uses. So the HA lock was allocated by the method/property call and the HA lock was required for schedule_update_ha_state. Hence the deadlock. This was a year ago, so the details may not be 100%.