Running latest 2022.5 version on Docker on Ubuntu x86 computer.
When (re)starting the docker container, everything works great, all works and everything.
After some time, about an hour or so, various. things stop working:
HomeAssistant is not accessible through :8123 port
I’ve also added HealthChecks.io support - health checks are stopped being fired.
Automations stop working
On the other hand - some other things keep “working”:
Docker container is alive and well
Logs of background services are still being written (i.g. INFO (Thread-14) [pychromecast.socket_client] [Entryway clock(192.168.1.123):8009] Connection reestablished!)
This started sometime with 2022.5 (not sure exactly when).
I’ve tried to change the logs to debug (all of them, via `logger: default: debug") – but no correlated logs appear at that time, it looks like everything works well.
How can I debug the issue?
Anyone experienced something similar?
Some extra data if needed:
Version core-2022.5.4
Installation Type Home Assistant Container
Development false
Supervisor false
Docker true
User root
Virtual Environment false
Python Version 3.9.9
Operating System Family Linux
Operating System Version 5.15.0-27-generic
CPU Architecture x86_64
Timezone Asia/Jerusalem
How are you checking this issue? In browser, through app, both?
If you have Linux, you may try nmap haserverip
This will list all available ports and might tell something about problem
I use portainer. Portainer allows access to live log of docker container. This would be good for checking this problem
Could this be memory issue or storage problem.
I’ve seen similar issue on raspberry pi when sd card storage is failing.
Have you reboot host server?
I’ve seen host server updates create havoc with HA docker container. Rebooting twice(not sure why) resolved this. As a rule I expect updates on host kernel to cause this and usually notice it immediately.
Have you tried go back yo older HA version? Does this change issue. Would help verify if it’s version related
I’ve had the same issue with a integration that was choking.
HA received so many updates from the sensors of that integration that it stopped working after x time.
For me it was about 12~18 hrs.
Perhaps you could disable some integrations or automations to see if any of them is causing this issue.
It could be a simple bad logic flow in an automation causing this.
I recently had an issue with an automation for a phone charger that tried to switch it on constantly until the device was detected as charging. It would eat up all the allocated RAM at a rate of about 1Gb/10 mins and render my HA inaccessible within an hour, but seemed to half limp along. Like a zombie
So with that in mind, I’d recommend restarting HA, de-activate all automations, restart again and monitor it. If it’s stable then restart them 1/2 at a time until you hone down what one may be causing the hang-up.
If disabling all the automations don’t resolve it then you at least know they aren’t the cause and you can try disabling the integrations to see if a device is the source of the problem
I’ve had some issues with Docker containers on Synology that stop responding (or rather, stop accepting network connections). When it happens, there are some messages being logged about TCP socket memory being depleted.
I was developing a custom integration and was using Python locks to try to coordinate threads. Got some reports of Home Assistant locking up (eg UI stopped responding, required HA to be restarted) - since this started occurring immediately after that change - I suspected it. Also I’ve seen so many deadlocks in my career - I was pretty sure that was the culprit. I reworked the app to avoid needing a lock at all.
We may want to do a code search on that integration looking for locks or other multi threading constructs. What integration was it?
I removed the mutex and added ‘PARALLEL_UPDATES=1’ to let HASS prevent multiple access to the API.
Will continue investigating this weird case, I am not sure how HASS infrastructure access our API objects and whether it’s from different threads or different workers.
It can and does. Be glad to look at it. For me what I saw doing callbacks into
schedule_update_ha_state()
With a lock, and then having HA call into entity methods / properties in parallel that would use the lock, because HA has some internal locks it uses. So the HA lock was allocated by the method/property call and the HA lock was required for schedule_update_ha_state. Hence the deadlock. This was a year ago, so the details may not be 100%.