2023.8: Update breaks automations

Interesting… I am not all that technical. How and where do I set up this auto-entities card ? In Overview, in Yaml ?

Is it possible for you to determine if the failing automations exclusively involve entities based on the Zwave integration? Or they fail regardless if the entities are based on Zwave or Zigbee?

1 Like

For the automations that fail, like the simple one you posted involving turning on two switch entities, list the integrations used by the entities.


EDIT

Earlier you mentioned you had a solid Zwave network with 80 devices. Are all of your failing automations communicating exclusively with Zwave devices? Do you have any automations that don’t fail and do they communicate with Zwave devices or something else?

The goal here is to determine if the problem is limited to automations involving the Zwave integration or if it occurs for other integrations as well (and which ones).

Instructions are here, It seems to have support for gui editing in dashboards, however that seemed broken to me, but you can use yaml mode and paste the above example when you have installed the custum card.

Thanks… I try to stay away from Yaml and I don’t find the auto-entities card as an option. But no worries, this is not a priority for me. Right now, I want to know why now almost every day (or even sometimes several times a day) I need to restart HA to get Automations run properly. In fact I just created an Automation to restart every day at 4 am.

You can go to Developer Tools > States, enter current: 1 in the Attributes column to list all automations that are currently executing.

Here’s an example showing I currently have no running automations.

FWIW, if you change current: 1 to current: 2 it will list all automations that are currently running and have a second instance waiting in the queue.

1 Like

In my above example I included all automations but excluded the ones with current 0 - because when you include only the ones with current 1 you don’t get those with count above 1. I wouldn’t be surprised if the “hanging” ones racked up a high current value.

You can paste this into the template editor and it’ll show you what’s running

Open your Home Assistant instance and show your template developer tools.

running:
{%- for a in states.automation | selectattr('attributes.current', 'defined') | selectattr('attributes.current', '>', 0) | sort(attribute='attributes.current', reverse=True) %}
  {{ a.entity_id }}: {{ a.attributes.current }}
{%- endfor %}
1 Like

I can confirm that most of my automations with mode: restart hangs.

By using that template you provided, I can see that:

automation.some_automation: 3

Not sure if that means 3 of them are running. Which anyhow should not happen as it is mode: restart

The reason for making the automation restart is that it acts as a “debounce”. The trigger might fire 1-3 times depending on circumstances that is hard to know. My automation basically sleeps for 2 seconds, then does what it is supposed to do ONCE even if it was triggered say 3x within a second.

Now it only hangs and nothing in the logs.

The trace says “still running” but there is nothing that waits for 5+ minutes in the automaton. It is doing something local that takes 1s to complete (and 2 seconds waiting before actually doing)


I mean, it has not even “left” the trigger step in the trace.

Can you share this automation (the yaml) in this issue please

Also, make sure you include information on the hardware/integration being used in the automation

That’s a good description of the problem you’re experiencing. Yes, the 3 means that there are three instances of the automation, one is executing and the other two are queued for execution. Like you said, that seems unusual given that the automation’s mode is restart (i.e. it’s not queued or parallel).

Regarding your statement:

most of my automations with mode: restart hangs

For the ones that hang, what is the integration (or integrations) of the entities that the failed automations are communicating with? There have been reports that this problem is occurring for entities based on the Zwave integration. However, it’s unclear if that’s the only integration that has exposed the problem or there are others.

If you want it to prevent multiple consecutive operations, restart is not the most logical choice. Why not set it to single, react immediately, wait a few seconds to block others? Because with restart HA stops the running automation to start a new one. But if the action is already in the works and taking some time then multiple executions might pile up anyway. Could it be the actual workload takes a long time, and the following triggers are waiting for the workload to finish?

And if it is because multiple movement should delay the light turning off: i.m.o. you should not schedule light off based on motion starting, but on motion ending. Prolonged motion detected should also not turn off the light. And instead of waiting inside an automation you could consider using a timer. That survives HA reboots, and running automations do not. Renewed motion could then reset the timer. I have almost no automations that use a delay - they all finish in a very short time.

1 Like

If it is behaving that way then it isn’t complying with restart mode’s documented behavior.

From Script Modes:

Start a new run after first stopping previous run.

image

But what if e.g. turning off a device takes a long time due to timeouts communicating? It does not promise to abort singular actions, it will probably wait for that to end and then stop the automation. I would expect it to stop a delay, but not to halfway abort a light_on service call.

Good question.

To comply with the documentation, the automation should not wait for anything and simply abort the automation’s execution and start over (restart). For example, an in-progress delay is not allowed to complete; it’s cancelled immediately.

If aborting the automation requires waiting for an in-progress service call to complete (possibly new behavior since 2023.7.0) then that’s a new wrinkle that’s not covered by the documented behavior. Effectively, it’s making restart behave more like queued where “start a new run after all previous runs complete”.

Background: I have a multiple-gang physical button. It has “high”, “medium”, “low”, and “off”-setting for my lights in that room. Sometimes the button sends the same command repetedly (1-3 times, within a few milliseconds) depending on how “firm” you press that button.

The action of setting the scene is simply a scene.turn_on, but that scene (it is a zigbee group scene with default transition of 1s) will fade in during one second. If one of those “repeated” commands is sent during the transition, the lights will flicker. Thus the need to debounce. Also, for some unknown zigbee network reasons, I sometimes do need to repeat the scene.turn_on to make it actually happen. So I have to wait one+ second before re-applying.

So, my automation is basically:

  • delay: 100ms
  • scene.turn_on: scene.something
  • delay: 2000ms
  • scene.turn_on: scene.something

But if I change my mind, e.g. by mistake press the “high” button, and 0.5 seconds afterwards click the “low” button, I actually want to abort during that second delay. Exactly like mentioned:

So, if I have mode: restart on above automation; then the 1-3 repeated commands from the button consolidated into one, and fires 100ms after the press which is acceptable. Then, if I do nothing more, the same scene is being applied once more to make sure all lights received it (if not, sometimes not all lights get the message). But if I do change my mind, I don’t want to wait 2s before the button can be pressed again; thus the “restart” aborts current execution and starts the cycle all over again with a new scene setting.

Corner case, perhaps. But anyhow that is how I ended up where I ended up. It has been working fine for over a year.

The actual communication time should be basically immediate; it is scenes through zigbee2mqtt; so the only thing you need to wait for is for the mqtt-publish-command with qos 0 and that typically takes single-digit milliseconds. You don’t actually have to wait for the radio as you have the broker in between.

This is exactly my point, to. If i have a mode: restart then having 3 concurrent executions as shown above:

Then something is seriously wrong.

It should NOT pile up. Every attempt to pile up should abort any current execution and start from beginning thus any other cases than 0 or 1 instance running is an invalid state. with mode: restart there is no queue to pile stuff up in.

If i change to mode: queued the problem goes away instantly, and the automation never locks. If I change back; then I need to restart home assistant after the first press of that button.

The way you implemented it now, the debounce delay is noticeable because as I understand it you only execute the real action after it, even if it is short. I think it is annoying, and it invites people to press again or longer because they think it does not respond, when it actually does. So the implementation aggravates the need for the debounce, and may also invite to use even longer delays.

If you create different automations per type of action, they won’t need to abort one another. So if you make one for each type of action, make them all single execute and put the delay at the end then you won’t need to abort or retry. Each action will responds right away, but repeated same events won’t respond again for the duration of the delay.

if you read my code; the delay is 0.1 second. Then the repeat comes 2s afterwards and is not noticed at all, unless of course a light did not get it the first time. In which case you do not need to press again.

Yes, I do, since I send repeated commands. If i have two parallel runners; one trying to set it to high ever 2s and another one that wants to set it to “low” every 2s they will blink the lights. In my solution, the other one gets aborted.

And by the way, if zigbee works as it should, then it would retry itself if communication failed. You may have zigbee problems.

I know, and it does. It is actually misbehaving lights. Re-applying the scenes is a workaround, I admit. Yet I have not found a better solution to it.

Anyhow, despite that, the automation should not hang and there should not be more than 1 concurrent execution of a mode: restart and others have this problem too.