2023.8: Update breaks automations

Sorry, I hadn’t seen that :wink:

Restarting Home Assistant serves to cancel all in-progress automations and reload all of them. That explains why the problem seems to disappear after a restart.

The current theory is the service call is waiting for a reply (from the entity’s integration) that is never received. Instead of timing out, it waits forever. So there appear to be two problems requiring investigation:

  1. There’s no reply.
  2. There’s no timeout.

Thanks, This may be too complicated for me and I don’t find Lovelace anywhere.

You can still use the simple technique I mentioned 12 days ago.

Thanks !!!

To the people experiencing the problem, try the following experiment to see if it helps to avoid the problem. Add continue_on_error: true to each service call that causes the automation to wait forever (i.e. get ‘stuck’).

Example

- service: light.turn_on
  continue_on_error: true
  target:
    entity_id: light.kitchen
  data:
    brightness_pct: 80

Reference

Continuing on error

1 Like

Yes, I’ve been doing that for a few days on a few sensitive automations, but it’s hard to get everything back. I hope a real solution will be found soon…

Are you saying that adding continue_on_error: true to a service call fails to prevent it from waiting forever?

It works, but doing it on all automations is very time-consuming ! Especially since I have automations where the devices called do not cause errors…

Glad to hear continue_on_error: true works.

Simply add it to the service calls that experience the problem. If you feel that’s too much work, you’ll have to live with automation failures until the development team identifies the cause and corrects it.

3 Likes

That was a smart idea.

Will you add this to the GitHub issue?

I got the idea from what allenporter wrote here:

I am reading some of the changes and it seems like before what would happen is the service would wait for a timeout then proceed anyway even if the call timed out. I think what we should do instead is timeout explicitly, and fail, and allow use of continue_on_error to continue anyway.

It appears that using continue_on_error: true can already be used to abort waiting (endlessly) for a reply to the service call.

To be clear, I consider this to be a workaround because people are reporting problems for automations that have worked properly in previous versions. Something yet to be identified is now causing endless waiting.

2 Likes

hi all, apologies for potentially hijacking this thread, but for people experiencing this problem with automations involving zwave devices, I’d like to look into whether or not the Z-Wave JS integration or driver is somehow contributing to this problem.

I’ve already reviewed the Z-Wave JS PRs introduced in 2023.7 and I don’t think this is newly introduced behavior, but rather the automation changes introduced in 2023.7 may have exposed an existing issue with zwave-js that was previously hidden from users (and us devs) because HA would stop waiting for the service call to complete after 10 seconds (it no longer does this)

If you’d like to help, please provide the following:

  • Automation YAML definition
  • Automation trace ideally, but if that’s not possible because the automation never finishes, an indication of what step in the definition the automation run is hanging on
  • Debug level zwave_js integration logs
  • Debug level zwave-js-server-python library logs
  • Debug level zwave-js driver logs (this is the addon logs for Z-Wave addon users, the Docker container logs for zwave-js-server or zwave-js-ui for bare Docker users, or zwave-js-server logs for the people running the server on the command line)

While I realize there isn’t much information here, this section of the docs may help you in obtaining the driver logs: Z-Wave - Home Assistant

For the integration and library logs, you can update your HA configuration, or use the services listed here: Logger - Home Assistant

For any additional help in obtaining the logs, please ask in the Discord #zwave channel

If you can’t publish this info here, you can open a GitHub issue, or you can DM me on discord (same username). Thanks!

4 Likes

its not just zwave, after the upgrade to any 2023.8 version various automations stop working, I have one blueprint for a 3 band opple switch as mentioned previously, it does not execute the act for zigbee lights, blinds or reolink camera actions, or hue integration actions (apart from turn off on hue stuff), but works flawlessly with soma tilt unit still, but not a soma roller (on tilt everything works, on roller only close works)…

Something is more deep afoot hears as it seems to be a issue with the automations rather than a specific integration.

changing the mode of a automation or adding any code is not going to fix it, something is broken, putting a band aid on it will just mean more problems down the line, people need to post on git to get it fixed.

1 Like

what’s going on is that there was a 10 second timeout on service calls that no longer exists. This has exposed the fact that some services that would hang indefinitely. Each of these integrations need to be looked at one by one

4 Likes

So hue, mqtt, reolink all need to be fixed? (especially weird as hue will turn off the lights it controls), but not on, it seems unlikely (but i am no expert) because all the items broke in these automations work fine in 2023.7 and before and not delayed they are instant. even without this time out, surely they should all still work as they did previously? My issue here, is that its only effecting maybe 10% or more of users, who will have the savy enough to add the bandaid fix to their automations.

From webcoding i know bandaiding something now will only lead to further problems down the road.

Listen, you’re more than welcome to question bugs, but at this point we’ve narrowed down the issue to exactly what Raman is talking about. You’re talking with a lead Zwave Dev who has been talking with the people who made the change in 2023.7. We are 99% sure what raman described is the cause because we can replicate it.

4 Likes

Here’s my interpretation of the issue (and it may be an oversimplification):

In previous versions, if an integration failed to acknowledge a command, like a service call in an automation, the automation would give up waiting for the integration’s reply after ten seconds.

  • The advantage of having a timeout is that it avoids waiting indefinitely for a reply that may never be received.

  • The disadvantage of the timeout is that it masks potential problems in the integration. By simply giving up and moving on to execute the next action (if any) no one is aware that something in your system failed to work normally.

By eliminating the timeout, those failures are now readily apparent (because the automation waits indefinitely for the reply). The focus now is on correcting the integrations that, on occasion, fail to reply promptly.

2 Likes

as I said I am no expert in this, but Taras line probably says it best, integrations as in multiple need fixing, including hue, reolink, mqtt… weird thing is though the someone (which I know its not a official but a hacs) for the tilt blind all buttons will still work, but for roller, only down will work (pretty much like hue automation in that only off will work…

Of course I did band aid my automations to single to temp fix. but I am just suggesting with so many integrations not reponding with various errors and some off commands work but not on… could their be another issue some place else…

again I could be 100% wrong

Can you link to the posts in the community forum, or Issues in GitHub, reporting problems with the integrations you mentioned? The majority that I have seen (forum and GitHub) are for Zwave and ZigBee.

Here’s one data point: I haven’t experienced any problems with Hue or MQTT in 2023.7.3 or 2023.8.3.