Parts of HA get unresponisve after a few hours. Supervisor not reachable

Hi all,
I have this weird issue since roughly 2 weeks:
After a few hours of runtime parts of my HA instance (HAOS 2023.01.2 on a 4GB ThinClient) are unresponsive. Mainly Shelly and WLED. I cannot open the supervisor logs in the UI, it gives me a Failed to get supervisor logs, 502: Bad Gateway. When I try and reboot through the settings menu nothing happens. When I open the “hardware” menu, I don’t see my installation achitecture (x64) anymore and I cannot reboot the host.
I can’t access the Addon’s page either.
But the rest works fine. Automations (mainly zigbee/zha) run fine, I can connect with SSH and see nothing fancy in htop. From the command line I can access the supervisor logs and there is nothing special, no warnings or errors.
I’ve tried downgrading to Core 2022.12.9 but no changes.

Any ideas what could be the problem and how to fix it?

This is a dumb question, but have you physically unplugged the host?

There is no such thing as a dumb question.
I have tried several times, didn’t fix the problem. Acutially this is the only way I can “restart” HA without SSH.

Do you use the Eufy Security integration from HACS by any chance?

Indeed I do. Could this cause the problem?

Believe so. I’ve got a report here similar to yours:

Seems like all the users on it were using that integration and by adding TRUSTED_DEVICE_NAME to the addon it was resolved.

I think you actually confirmed my suspicion at the bottom as well. That really this isn’t supervisor at all, the integration is hosing core.

While in this situation you were able to SSH in and restart core via the CLI (ha core restart)? And that fixed it temporarily? If so then the integration is definitely locking up core.

EDIT: @medri could you also check if disabling the integration prevents this from happening? Don’t stop the addon, just disable the integration. Want to try to really narrow it down.

1 Like

I could log in via SSH and tried 2 different things that got HA to restart

  1. ha core rebuild
  2. ha core restart
    Restarting core fixed it temporarily.

I was too quick in uninstalling it :smile: I’ll try in the next few days.
@CentralCommand Should I report here or to the github issue you mentioned?

So I’d actually start by reporting it on the repo I linked. Make sure to include the details of what happened and that disabling/removing the integration fixed it. It seems like the integration may be creating a deadlock situation and it should definitely not be doing that.

As custom integrations are set up its less of an HA bug and more of a bug with that integration right now. It would be nice if custom integrations were more isolated but that’s not possible in the current architecture. They run in the same python process so a bad bug (like this one appears to be) can break far more things then just the integration.