I have an excellent Zwave mesh network with some 80 Zwave devices. I doubt that range has anything to do with my issue. This being said, I am no longer sure that it is a 2023.8 issue as I rolled back to 2023.7 and still have the same problem. The only way to resolve when it happens is to reboot my Yellow Box. Next time it happens I will post the log to see if anyone can identify the cause.
not range, time to respond.
Thanks. Is there any information that I can post that would help identifying the cause ?
Likewise - I have around the same number of Z-Wave devices, a fantastic mesh, no range issues at all in fact I even have one in my chicken coop in backyard and in my letterbox at front no issues at all. However this particular Z-Wave device is through three walls so range not an issue but signal strength is, thus there can be short delays this this particular one. I should improve the mesh over in that area but haven’t got around to it.
I am watching this thread in earnest as well. My automations have been getting stuck for the last few weeks, and a reboot fixes it for a short while. I rolled back through all of the July releases and tried out the latest August release - none of them helped. I’m now on:
Home Assistant 2023.6.0
Supervisor 2023.07.1
Operating System 10.4
Frontend 20230607.0 - latest
And my automations are now back to running normally so far today.
I would see the “already running” on automations and some things like MQTT integrations and my Bond integration. Seem like some sort of core async functionality is gumming things up, and automations are the most visible issue.
Indeed, switching my automation mode from “Single” to “Restart” helps short term - but again - everything was working on my system for the last few years. My environment is pretty static but probably on the large size - 54 Z-wave devices and 51 Zigbee devices.
I use the custom auto-entities card to see running automations:
- type: custom:auto-entities
card:
show_name: true
show_icon: false
show_state: false
type: glance
title: Lopende automatiseringen
columns: 1
filter:
include:
- domain: automation
exclude:
- attributes:
current: 0
sort:
method: last_triggered
show_empty: true
Interesting… I am not all that technical. How and where do I set up this auto-entities card ? In Overview, in Yaml ?
Is it possible for you to determine if the failing automations exclusively involve entities based on the Zwave integration? Or they fail regardless if the entities are based on Zwave or Zigbee?
For the automations that fail, like the simple one you posted involving turning on two switch
entities, list the integrations used by the entities.
EDIT
Earlier you mentioned you had a solid Zwave network with 80 devices. Are all of your failing automations communicating exclusively with Zwave devices? Do you have any automations that don’t fail and do they communicate with Zwave devices or something else?
The goal here is to determine if the problem is limited to automations involving the Zwave integration or if it occurs for other integrations as well (and which ones).
Instructions are here, It seems to have support for gui editing in dashboards, however that seemed broken to me, but you can use yaml mode and paste the above example when you have installed the custum card.
Thanks… I try to stay away from Yaml and I don’t find the auto-entities card as an option. But no worries, this is not a priority for me. Right now, I want to know why now almost every day (or even sometimes several times a day) I need to restart HA to get Automations run properly. In fact I just created an Automation to restart every day at 4 am.
You can go to Developer Tools > States, enter current: 1
in the Attributes column to list all automations that are currently executing.
Here’s an example showing I currently have no running automations.
FWIW, if you change current: 1
to current: 2
it will list all automations that are currently running and have a second instance waiting in the queue.
In my above example I included all automations but excluded the ones with current 0 - because when you include only the ones with current 1 you don’t get those with count above 1. I wouldn’t be surprised if the “hanging” ones racked up a high current value.
You can paste this into the template editor and it’ll show you what’s running
running:
{%- for a in states.automation | selectattr('attributes.current', 'defined') | selectattr('attributes.current', '>', 0) | sort(attribute='attributes.current', reverse=True) %}
{{ a.entity_id }}: {{ a.attributes.current }}
{%- endfor %}
I can confirm that most of my automations with mode: restart
hangs.
By using that template you provided, I can see that:
automation.some_automation: 3
Not sure if that means 3 of them are running. Which anyhow should not happen as it is mode: restart
The reason for making the automation restart is that it acts as a “debounce”. The trigger might fire 1-3 times depending on circumstances that is hard to know. My automation basically sleeps for 2 seconds, then does what it is supposed to do ONCE even if it was triggered say 3x within a second.
Now it only hangs and nothing in the logs.
The trace says “still running” but there is nothing that waits for 5+ minutes in the automaton. It is doing something local that takes 1s to complete (and 2 seconds waiting before actually doing)
I mean, it has not even “left” the trigger step in the trace.
Can you share this automation (the yaml) in this issue please
Also, make sure you include information on the hardware/integration being used in the automation
That’s a good description of the problem you’re experiencing. Yes, the 3 means that there are three instances of the automation, one is executing and the other two are queued for execution. Like you said, that seems unusual given that the automation’s mode
is restart
(i.e. it’s not queued
or parallel
).
Regarding your statement:
most of my automations with
mode: restart
hangs
For the ones that hang, what is the integration (or integrations) of the entities that the failed automations are communicating with? There have been reports that this problem is occurring for entities based on the Zwave integration. However, it’s unclear if that’s the only integration that has exposed the problem or there are others.
If you want it to prevent multiple consecutive operations, restart is not the most logical choice. Why not set it to single, react immediately, wait a few seconds to block others? Because with restart HA stops the running automation to start a new one. But if the action is already in the works and taking some time then multiple executions might pile up anyway. Could it be the actual workload takes a long time, and the following triggers are waiting for the workload to finish?
And if it is because multiple movement should delay the light turning off: i.m.o. you should not schedule light off based on motion starting, but on motion ending. Prolonged motion detected should also not turn off the light. And instead of waiting inside an automation you could consider using a timer. That survives HA reboots, and running automations do not. Renewed motion could then reset the timer. I have almost no automations that use a delay - they all finish in a very short time.
If it is behaving that way then it isn’t complying with restart
mode’s documented behavior.
From Script Modes:
Start a new run after first stopping previous run.
But what if e.g. turning off a device takes a long time due to timeouts communicating? It does not promise to abort singular actions, it will probably wait for that to end and then stop the automation. I would expect it to stop a delay, but not to halfway abort a light_on service call.