Parts of HA get unresponisve after a few hours. Supervisor not reachable

medri · January 12, 2023, 5:45pm

Hi all,
I have this weird issue since roughly 2 weeks:
After a few hours of runtime parts of my HA instance (HAOS 2023.01.2 on a 4GB ThinClient) are unresponsive. Mainly Shelly and WLED. I cannot open the supervisor logs in the UI, it gives me a Failed to get supervisor logs, 502: Bad Gateway. When I try and reboot through the settings menu nothing happens. When I open the “hardware” menu, I don’t see my installation achitecture (x64) anymore and I cannot reboot the host.
I can’t access the Addon’s page either.
But the rest works fine. Automations (mainly zigbee/zha) run fine, I can connect with SSH and see nothing fancy in htop. From the command line I can access the supervisor logs and there is nothing special, no warnings or errors.
I’ve tried downgrading to Core 2022.12.9 but no changes.

Any ideas what could be the problem and how to fix it?

jivgym · January 12, 2023, 8:20pm

This is a dumb question, but have you physically unplugged the host?

medri · January 12, 2023, 9:25pm

There is no such thing as a dumb question.
I have tried several times, didn’t fix the problem. Acutially this is the only way I can “restart” HA without SSH.

CentralCommand · January 12, 2023, 9:28pm

Do you use the Eufy Security integration from HACS by any chance?

medri · January 12, 2023, 9:28pm

Indeed I do. Could this cause the problem?

CentralCommand · January 12, 2023, 9:37pm

Believe so. I’ve got a report here similar to yours:

github.com/home-assistant/supervisor

After a certain time, Supervisor Communication from Core seems to stop working

opened 09:50AM - 06 Jan 23 UTC

closed 11:03PM - 09 Jun 23 UTC

Tscherno

bug stale

### Describe the issue you are experiencing Hi, i'm using HASS OS as OVA on …a Virtual Machine on my Synology. It worked for a few years without issues. Lately it started to have some very strange issue. It could be connected to the latest supervisor update, but i'm not 100% sure. What it does: - After Restart (OS or Core) everything works perfectly - Some time later (probably 30 minutes) the problems begin + Addons are not accessible via ingress anymore: trying to access them gives a timeout or Unable to load the panel source: /api/hassio/app/entrypoint.js. + Addons itself are running (i see it for example for fully working Zigbee2MQTT addon in MQTT) + Trying to access Settings => Addons gives "Could not load the Supervisor panel!" + Restarting from the interface doesn't work anymore The observer-statuspage gives # Home Assistant observer |Supervisor:|Connected| | --- | --- | |Supported:|Supported| |Healthy:|Healthy| Restarting / Reparing the observer doesn't help Restarting HA Core via CLI fixes the issue temporarly again. ### What type of installation are you running? Home Assistant OS ### Which operating system are you running on? Home Assistant Operating System ### Steps to reproduce the issue 1. Restart HA Core 2. Wait some time ### Anything in the Supervisor logs that might be useful for us? ```txt No noticeable errors ``` ### System Health information ## System Information version | core-2023.1.0 -- | -- installation_type | Home Assistant OS dev | false hassio | true docker | true user | root virtualenv | false python_version | 3.10.7 os_name | Linux os_version | 5.15.80 arch | x86_64 timezone | Europe/Berlin config_dir | /config <details><summary>Home Assistant Community Store</summary> GitHub API | ok -- | -- GitHub Content | ok GitHub Web | ok GitHub API Calls Remaining | 5000 Installed Version | 1.29.0 Stage | running Available Repositories | 1208 Downloaded Repositories | 46 </details> <details><summary>Home Assistant Cloud</summary> logged_in | true -- | -- subscription_expiration | November 30, 2023 at 1:00 AM relayer_connected | true remote_enabled | true remote_connected | true alexa_enabled | true google_enabled | false remote_server | eu-central-1-4.ui.nabu.casa can_reach_cert_server | ok can_reach_cloud_auth | ok can_reach_cloud | ok </details> <details><summary>Home Assistant Supervisor</summary> host_os | Home Assistant OS 9.4 -- | -- update_channel | beta supervisor_version | supervisor-2022.12.1 agent_version | 1.4.1 docker_version | 20.10.19 disk_total | 491.4 GB disk_used | 56.8 GB healthy | true supported | true board | ova supervisor_api | ok version_api | ok installed_addons | Samba share (10.0.0), File editor (5.4.2), SSH & Web Terminal (13.0.0), Grafana (8.1.0), MariaDB (2.5.1), Frigate (0.11.1), Node-RED (14.0.1), evcc (0.110.1), Eufy Security Add-on (1.3.0), UniFi Network Application (2.5.0), phpMyAdmin (0.8.3), ESPHome (dev) (dev), RTSP Simple Server Add-on (v0.17.6), Zigbee2MQTT (1.29.0-1) </details> <details><summary>Dashboards</summary> dashboards | 4 -- | -- resources | 32 views | 9 mode | storage </details> <details><summary>Miele</summary> component_version | 0.1.4 -- | -- reach_miele_cloud | ok </details> <details><summary>Recorder</summary> oldest_recorder_run | December 30, 2022 at 8:56 AM -- | -- current_recorder_run | January 6, 2023 at 9:47 AM estimated_db_size | 1002.31 MiB database_engine | mysql database_version | 10.6.8 </details> <details><summary>Sonoff</summary> version | 3.3.1 (b20e33c) -- | -- cloud_online | 0 / 1 local_online | 0 / 0 </details> ### Supervisor diagnostics [config_entry-hassio-a0598a45fdd52b1dd951477d67f04d27.json.txt](https://github.com/home-assistant/supervisor/files/10359167/config_entry-hassio-a0598a45fdd52b1dd951477d67f04d27.json.txt) ### Additional information _No response_

Seems like all the users on it were using that integration and by adding TRUSTED_DEVICE_NAME to the addon it was resolved.

I think you actually confirmed my suspicion at the bottom as well. That really this isn’t supervisor at all, the integration is hosing core.

While in this situation you were able to SSH in and restart core via the CLI (ha core restart)? And that fixed it temporarily? If so then the integration is definitely locking up core.

EDIT: @medri could you also check if disabling the integration prevents this from happening? Don’t stop the addon, just disable the integration. Want to try to really narrow it down.

medri · January 12, 2023, 11:20pm

I could log in via SSH and tried 2 different things that got HA to restart

ha core rebuild
ha core restart
Restarting core fixed it temporarily.

I was too quick in uninstalling it I’ll try in the next few days.
@CentralCommand Should I report here or to the github issue you mentioned?

CentralCommand · January 12, 2023, 11:24pm

So I’d actually start by reporting it on the repo I linked. Make sure to include the details of what happened and that disabling/removing the integration fixed it. It seems like the integration may be creating a deadlock situation and it should definitely not be doing that.

As custom integrations are set up its less of an HA bug and more of a bug with that integration right now. It would be nice if custom integrations were more isolated but that’s not possible in the current architecture. They run in the same python process so a bad bug (like this one appears to be) can break far more things then just the integration.