hi all, apologies for potentially hijacking this thread, but for people experiencing this problem with automations involving zwave devices, I’d like to look into whether or not the Z-Wave JS integration or driver is somehow contributing to this problem.
I’ve already reviewed the Z-Wave JS PRs introduced in 2023.7 and I don’t think this is newly introduced behavior, but rather the automation changes introduced in 2023.7 may have exposed an existing issue with zwave-js that was previously hidden from users (and us devs) because HA would stop waiting for the service call to complete after 10 seconds (it no longer does this)
If you’d like to help, please provide the following:
Automation YAML definition
Automation trace ideally, but if that’s not possible because the automation never finishes, an indication of what step in the definition the automation run is hanging on
Debug level zwave_js integration logs
Debug level zwave-js-server-python library logs
Debug level zwave-js driver logs (this is the addon logs for Z-Wave addon users, the Docker container logs for zwave-js-server or zwave-js-ui for bare Docker users, or zwave-js-server logs for the people running the server on the command line)
While I realize there isn’t much information here, this section of the docs may help you in obtaining the driver logs: Z-Wave - Home Assistant
For the integration and library logs, you can update your HA configuration, or use the services listed here: Logger - Home Assistant
For any additional help in obtaining the logs, please ask in the Discord #zwave channel
If you can’t publish this info here, you can open a GitHub issue, or you can DM me on discord (same username). Thanks!
its not just zwave, after the upgrade to any 2023.8 version various automations stop working, I have one blueprint for a 3 band opple switch as mentioned previously, it does not execute the act for zigbee lights, blinds or reolink camera actions, or hue integration actions (apart from turn off on hue stuff), but works flawlessly with soma tilt unit still, but not a soma roller (on tilt everything works, on roller only close works)…
Something is more deep afoot hears as it seems to be a issue with the automations rather than a specific integration.
changing the mode of a automation or adding any code is not going to fix it, something is broken, putting a band aid on it will just mean more problems down the line, people need to post on git to get it fixed.
what’s going on is that there was a 10 second timeout on service calls that no longer exists. This has exposed the fact that some services that would hang indefinitely. Each of these integrations need to be looked at one by one
So hue, mqtt, reolink all need to be fixed? (especially weird as hue will turn off the lights it controls), but not on, it seems unlikely (but i am no expert) because all the items broke in these automations work fine in 2023.7 and before and not delayed they are instant. even without this time out, surely they should all still work as they did previously? My issue here, is that its only effecting maybe 10% or more of users, who will have the savy enough to add the bandaid fix to their automations.
From webcoding i know bandaiding something now will only lead to further problems down the road.
Listen, you’re more than welcome to question bugs, but at this point we’ve narrowed down the issue to exactly what Raman is talking about. You’re talking with a lead Zwave Dev who has been talking with the people who made the change in 2023.7. We are 99% sure what raman described is the cause because we can replicate it.
Here’s my interpretation of the issue (and it may be an oversimplification):
In previous versions, if an integration failed to acknowledge a command, like a service call in an automation, the automation would give up waiting for the integration’s reply after ten seconds.
The advantage of having a timeout is that it avoids waiting indefinitely for a reply that may never be received.
The disadvantage of the timeout is that it masks potential problems in the integration. By simply giving up and moving on to execute the next action (if any) no one is aware that something in your system failed to work normally.
By eliminating the timeout, those failures are now readily apparent (because the automation waits indefinitely for the reply). The focus now is on correcting the integrations that, on occasion, fail to reply promptly.
as I said I am no expert in this, but Taras line probably says it best, integrations as in multiple need fixing, including hue, reolink, mqtt… weird thing is though the someone (which I know its not a official but a hacs) for the tilt blind all buttons will still work, but for roller, only down will work (pretty much like hue automation in that only off will work…
Of course I did band aid my automations to single to temp fix. but I am just suggesting with so many integrations not reponding with various errors and some off commands work but not on… could their be another issue some place else…
Can you link to the posts in the community forum, or Issues in GitHub, reporting problems with the integrations you mentioned? The majority that I have seen (forum and GitHub) are for Zwave and ZigBee.
Here’s one data point: I haven’t experienced any problems with Hue or MQTT in 2023.7.3 or 2023.8.3.
That’s interesting. It’s my understanding that the 10-second timeout was removed in 2023.7.0. Yet you’re not experiencing the ‘stuck automation’ problem in that version and neither is Tinkerer. Perhaps I’m mistaken and the timeout was only removed in 2023.8.0.
Alternatively, the timeout was indeed removed from 2023.7.0 and the fact Z2M misbehaves only in 2023.8.X perhaps serves as another clue for the development team.
Just to clarify, I’m using devices communicating via MQTT over Wifi/Ethernet (in 2023.8.3 without the ‘stuck’ automation issue). In contrast, Z2M does ultimately rely on Zigbee to communicate with the physical device; a delay in acknowledging a ZigBee command will be experienced by Home Assistant regardless if it communicates directly with the device via ZigBee or indirectly via MQTT<->ZigBee.
2023.7 was fine all releases this only occoured for me on 2023.8 release, @Taras you can band aid it by going into those blueprints via studio code server and change them to single instead of restart.
As for comments on the others I still find it strange, my first thought when booting up the first release that it was mqtt related as the switches still worked, and showed they were responding.
but then the hue bulbs turned off but not back on, None of the items tied via mqtt worked,
but then reolink spotlights stop responding… So I thought maybe the switches are just being funny, but know they are all showing actions when pressed, then to my suprise the sometilt blind in office worked fine, open and close via the switch but the soma roller in the kitchen would only close.
Before reverting back to 2023.7 the first time I double checked I could call the services via dev and that was fine, I then went and changed all automations and blue prints that did use restart to single as a band aid fix.
There might be nuances with the “restart” mode, as I only used “single” for all my automation. I’ve moved everything to “restart” from the madness and haven’t seen it hang yet. Just following along here to see if there are similar fixes to the situation.
The first users to report the ‘stuck’ automation problem employed mode: single. In fact, this makes sense because a ‘stuck’ single-mode automation will trigger once and then never again (because only one execution instance is permitted). Others have reported it also occurs with mode: restart because the ‘stuck’ automation refuses to abort the first execution instance in order to start over.
In other words, neither of the two modes is a guaranteed ‘band aid’ nor, for that matter, is using continue_on_error: true (which was reported to avoid the problem for some users but not others).
tl;dr
There’s no single, reliable workaround for all integrations experiencing the problem. It will be solved only when the development team mitigates it within the affected integrations.
In case you’re wondering where I am getting the data for my observations, I have read all reports about the problem posted in the community forum and in GitHub Core (and commented in most of them in an attempt to characterize it).
I tried single/restart modes and even added continue_on_error: true … it did help a little, but i still have automatons getting stuck …
funny enough one gets stuck on checking if the door is open, in theory that should not have an integration call as it should just check the current status (battery operated zwave sensor)…
but it does get stuck there … or at least the UI shows it like it does.
Have you received the data you need here? If not I can try to gather it tomorrow. I have a few automations that keep failing. The issue is they run every x minutes as a single mode, so by the time I catch them usually the traces are no longer available.
I have one now that won’t stop running even after I changed it and saved it
I follow this with interest. I was among the early ones reporting this issue after upgrading to 2023.8 . However for me the issue disappeared after a few days and the only change I consciously made is to run a HA restart at 4 am. I am not sure that this makes a difference and I may have done something else that I don’t remember. Among my well over 100 Automations there is only one or two where I changed single to restart.
Every time you restart, all parts of Home Assistant are reset. That means any automation that may have become ‘stuck’ is reset.
It’s the equivalent of regularly rebooting a slow computer as a means of fixing a memory leak. In other words, it doesn’t cure the actual problem, it only suppresses the symptoms.