Hi all,
I have this weird issue since roughly 2 weeks:
After a few hours of runtime parts of my HA instance (HAOS 2023.01.2 on a 4GB ThinClient) are unresponsive. Mainly Shelly and WLED. I cannot open the supervisor logs in the UI, it gives me a Failed to get supervisor logs, 502: Bad Gateway. When I try and reboot through the settings menu nothing happens. When I open the “hardware” menu, I don’t see my installation achitecture (x64) anymore and I cannot reboot the host.
I can’t access the Addon’s page either.
But the rest works fine. Automations (mainly zigbee/zha) run fine, I can connect with SSH and see nothing fancy in htop. From the command line I can access the supervisor logs and there is nothing special, no warnings or errors.
I’ve tried downgrading to Core 2022.12.9 but no changes.
Any ideas what could be the problem and how to fix it?
There is no such thing as a dumb question.
I have tried several times, didn’t fix the problem. Acutially this is the only way I can “restart” HA without SSH.
I think you actually confirmed my suspicion at the bottom as well. That really this isn’t supervisor at all, the integration is hosing core.
While in this situation you were able to SSH in and restart core via the CLI (ha core restart)? And that fixed it temporarily? If so then the integration is definitely locking up core.
EDIT: @medri could you also check if disabling the integration prevents this from happening? Don’t stop the addon, just disable the integration. Want to try to really narrow it down.
So I’d actually start by reporting it on the repo I linked. Make sure to include the details of what happened and that disabling/removing the integration fixed it. It seems like the integration may be creating a deadlock situation and it should definitely not be doing that.
As custom integrations are set up its less of an HA bug and more of a bug with that integration right now. It would be nice if custom integrations were more isolated but that’s not possible in the current architecture. They run in the same python process so a bad bug (like this one appears to be) can break far more things then just the integration.