Improved Stability

The single greatest issue for me is HA being unstable from time to time. Most often I’ll notice that the log file starts to rapidly bloat from repeated errors and if left unchecked that will eventually cause the whole system to stop responding and require me to restart the container within docker to get it going again. Most often for me now it happens overnight, so that when I wake up automations do not work and I can’t even turn off my alarm. An unpleasant awakening, to say the least.

The “official” response I get is usually “it’s a third party integration, fix that” or something similar. Which on the surface is true. But what I’m requesting is deeper than that. I do not believe a single integration or device should be able to take down my entire smart home. It would be great if HA could “pause” a misbehaving integration or device rather than let it run rampant and take down everything.

Some documented examples:
Vague template errors
Crashed by unexpected characters in a calendar

Thanks for your consideration.

I do understand that, if your instance crashes that often, you call for improvement on stability.
For many of us, HA is rock solid so it’s not necessary to do anything about that.

I don’t know if you will get much support in your request but in the meantime, I would advise you to search the root cause by either checking the logs and ask for help or try by eliminating stuff.

I’m glad for you, really. But when things go wrong, it goes south fast. Believe me. “Works on my machine” doesn’t change the fact that HA is coded optimistically and has little tolerance for error.

Check my first link. You’ll see I have checked the logs and do ask for help, and usually get about what I got there.

Of course I know that it can go south.
It’s wrong to expect software, made by humans, with such an open character to that many devices and protocols, to be without errors.

If you have started, and progressed quite some, to integrate it in your home, it can be a PITA when your instance is not reliable.

Have you searched the forum for similar reports?

Of course I have searched. But we’re getting off topic.

All I ask is consideration for some improvements when there are errors. I know it’s not a small ask and I know it isn’t happening overnight. There is always room for improvement.

1 Like

There is currently a number of custom_integrations that are not coded thread safe. They do bad things with threads. They have got away with it until the latest version of HA which finished the switch-over to python 3.12. This switch has been in the making foe months and developers of anything connecting to HA have had all this time to fif it. Most have, some have not.

BDraco has mapped out instructions on how to identify the ones that are bad.

In the mean time, HACS and everything in HA Docs tells you that Custom Integrations, themes, python scripts, blueprints, add-ons, and all that stuff you add that doesn’t come included in core can cause instability in your system and if you have problems, check those first. In this case remove all the custom-integrations and you will have a stabile system.

5 Likes

Thanks, that sounds like movement in the right direction.

BUT if you’re referring to the AssertionError, no, that was happening to me long before 2024.4. Actually, I don’t know that has happened to me since I upgraded (but that has only been about a week).

And again, I’m not making this about one issue, those are examples. Ideally, no one integration or device should be able to bring the system down.

Edit: also, note example 2, a core integration, still took me down. Those issues will happen, I know that. HA should too.

I am not a dev at all, just a user like you but have heard rumors that the thread unsafe actions may be detectable when they happen and if that is the case, I’m sure they will come up with something to lock out the offender as it happens and keep the core from crashing.

This is the nature of this kind of application, though. Open source and all of the Custom, whatever’s are not tested by Nabu Casa, rather by the code owner/contributor. The switch to the new python has shown several problems with contributed code.

1 Like

This could help you find the issue

1 Like

My instance locked again this morning, with Profiler running. The UI would not come up, nor were any automations running.

The last hour+ from the log is below, unedited. I can’t say I understand, but it’s almost like it got caught up trying to reconnect to my Sony Bravia Smart TV’s Chromecast and stopped responding to anything else. Does this make sense to anyone else?

It does seem to support what I’ve said…a Core component hangs up and took my whole instance down, again.

2024-04-26 05:22:02.317 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 05:22:02.458 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 05:22:02.486 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 05:22:02.748 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 05:25:16.814 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 05:25:17.034 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 05:25:17.045 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 05:25:17.383 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 05:28:31.828 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 05:28:31.988 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 05:28:32.004 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 05:28:32.300 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 05:30:43.830 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 05:30:44.007 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 05:30:44.018 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 05:30:44.446 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 05:33:56.625 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 05:33:56.631 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 05:33:56.647 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 05:33:56.758 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 05:37:07.422 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 05:37:07.438 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 05:37:07.455 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 05:37:07.584 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 05:39:18.041 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 05:39:18.052 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 05:39:18.069 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 05:39:18.170 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 05:41:25.125 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 05:41:25.142 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 05:41:25.159 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 05:41:25.286 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 05:43:35.812 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 05:43:35.829 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 05:43:35.840 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 05:43:35.946 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 05:47:53.317 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 05:47:53.333 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 05:47:53.355 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 05:47:53.512 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 05:50:03.978 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 05:50:03.984 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 05:50:04.006 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 05:50:04.139 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 05:53:54.968 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 05:53:54.999 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 05:53:55.000 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 05:53:55.117 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 05:56:05.965 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 05:56:05.992 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 05:56:06.008 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 05:56:06.171 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 05:59:16.940 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 05:59:16.956 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 05:59:16.967 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 05:59:17.146 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 06:02:28.672 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 06:02:28.679 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 06:02:28.695 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 06:02:28.826 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 06:04:39.272 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 06:04:39.278 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 06:04:39.284 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 06:04:39.389 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 06:07:53.329 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 06:07:53.350 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 06:07:53.366 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 06:07:53.528 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 06:11:54.362 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Heartbeat timeout, resetting connection
2024-04-26 06:11:54.368 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 06:11:57.489 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 06:14:08.376 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 06:14:08.393 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 06:14:08.409 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 06:14:08.570 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 06:17:25.693 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 06:17:25.709 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 06:17:25.715 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 06:17:25.850 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 06:21:26.659 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Heartbeat timeout, resetting connection
2024-04-26 06:21:26.665 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 06:21:26.671 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error writing to socket.
2024-04-26 06:21:26.786 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 06:23:40.280 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 06:23:40.296 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 06:23:40.302 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 06:23:40.437 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 06:25:47.349 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 06:25:47.355 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 06:25:47.371 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 06:25:47.503 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 06:29:04.621 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 06:29:04.632 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 06:29:04.633 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 06:29:04.753 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 06:31:18.471 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 06:31:18.487 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 06:31:18.498 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 06:31:18.608 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 06:33:25.517 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 06:33:25.523 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 06:33:25.539 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 06:33:25.693 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!
2024-04-26 06:35:36.534 ERROR (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error reading from socket.
2024-04-26 06:35:36.551 WARNING (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Error communicating with socket, resetting connection
2024-04-26 06:35:36.562 INFO (Thread-11) [pychromecast.controllers] Receiver:channel_disconnected
2024-04-26 06:35:36.713 INFO (Thread-11) [pychromecast.socket_client] [TV(192.168.2.229):8009] Connection reestablished!

In 2024.5.x we will introduce a debug mode that can catch many non-thread-safe operations. Coupled with turning on asyncio debug mode this will catch ~90% of threading implementation errors in integrations

If you have an integration blocking the event loop or doing non-thread-safe operations:

  1. Install profiler integration Link to Integrations: add integration – My Home Assistant
  2. Enable asyncio debug service as soon as possible after startup Profiler - Home Assistant
  3. Watch logs for RuntimeError: Non-thread-safe operation and long asyncio delays
  4. Download and post logs with full trace

For 2024.5.x and later Home Assistant debug mode can also be enabled in configuration.yaml

homeassistant:
  debug: true
6 Likes

I checked the entire log from the crash on 4/26, there were no “RuntimeError” or “thread-safe” in the logs (or variances thereof). I also checked my current logs (running since 4/26 with profiler active and no crashes) and also came up empty.

So perhaps there is something else going on here?

If debug mode was not on it won’t be looking for or blocking the non-thread-safe operations

So I need to wait for 2024.5 and enable that debug mode?

asyncio debug mode can be already be turned on with the profiler integration today by following the instructions in Improved Stability - #11 by bdraco

For Home Assistant’s debug mode, you’ll need to wait until 2024.5 is released tomorrow.

So we’re going in circles. Yes, I did that, and it is active now (my 4 day 22 MB log file) as well as in the log file from the crash mentioned above.

In that case, I think waiting until tomorrow’s release and enabling Home Assistant debug is the way to go.

All has been well for weeks, and then HA hung on my again this morning. Of course I did not have debugging turned on at the time.

From looking at the log, HA seems to have lost network connectivity entirely around 5 a.m. I restarted the container when I got up and it came up fine, and I have no evidence of anything else having network difficulty (and my HA runs in Docker on Synology DSM, I would usually get an email if the DDNS disconnects). But in addition to Chromecast errors similar to before I have a slew of timeouts, “took longer than scheduled update interval” or flat out “cannot connects” starting around that time, covering local resources (e.g. MQTT) and remote/internet ones. Very strange.

There is an updated guide available with more ways to track down problem integrations:

6 Likes

Thanks. I’m running with debug enabled now. We’ll see what happens…in a few weeks… :slight_smile:

I did find that it seems to be failing Friday mornings between 5-6, which is when I had the DSM Hyperbackup integrity check running. Maybe it’s a coincidence as I can’t think of why that would impact anything, but I rescheduled that as well, so at least I can see if it starts failing at a different time.