2023.8: Update breaks automations

FWIW, the use of automation.trigger isn’t a best practice in general, let alone using it an automation to call itself. More often than not, the automation’s design is improved if it avoids using automation.trigger.

Appreciate your input. I wasn’t aware that using automation.trigger was frowned upon. However, these automations have worked very reliably for literally years. This way I could make sure the end result of an automation was to shut off some light or appliance.

So, I guess my point is that having such long standing automations fail after a core update for no apparent reason (after all, some of the self-triggering automations just lived on happily, thus something else might be at fault) is a little unfortunate and somebody ought to look into it.

As for your offer to go through one of my more complex automations, it is appreciated. But please give me some time. I’d like to sanitize first so you can make heads and tails of the entities involved.

I think one of you should open an issue on github with these findings. It’s possible that the 2023.7 service changes are causing unintended blocks during service calls.

1 Like

Actually, for me it was even longer (RPi4) - seemingly endless with me staring at the screen with sweaty hands. Alas, up until now everything has been in ship-shape condition again running 2023.7.3.

@EndUser, since you experienced the same situation, would you be in a position to report this whole debacle on Home Assistant’s GitHub Core repository? I do not have a Github account myself.

I just did.

I believed I characterized it as not being a “best practice”. It may get the job done but, like a “goto” statement, tends to encourage the production of a suboptimal design.


Home Assistant regularly receives performance optimizations. A few versions ago, I know one of them affected the sort order of the expand function. The change wasn’t prominently mentioned in the Release Notes and it didn’t affect most templates … unless the template was dependent on the original sort order.

Speculation: Perhaps a similar optimization was made in 2023.8.0 and it doesn’t negatively impact automations … except in the rare case where the automation executes itself.

Is it this one?

Because if it is, it’s missing 80% of the information that is normally expected from an Issue (like an example of a failing automation, its trace, log messages, etc). The problem must be documented in the Issue, not elsewhere (such as this topic).

I don’t think so. The two Automations that stopped working (there may have been others but I did a Restore without checking other Automations) were simple like if A is ON then B.C and D are ON, and also if A is OFF then B.C and D are OFF.

1 Like

Post them in YAML format (instead of pseudo code).

alias: "#15-11 Bath All ON"
description: ""
trigger:
  - platform: state
    entity_id:
      - switch.15_1_bath_front
    to: "on"
condition: []
action:
  - service: switch.turn_on
    data: {}
    target:
      entity_id:
        - switch.15_2_bath_mini
        - switch.15_3_bath_back
mode: single
1 Like

You might very well have found a bug, but to me (at least logically) it makes sense that when an automation triggers itself with mode restart can cause trouble: At the point where you retrigger the automation, you’re not actually stopping any subsequent actions — there are none — except for the currently executing call, which is where the bug might be coming in. If there’s a timing issue where the currently running code needs to interrupt itself to run itself again, something very well may get stuck. It may even be impossible solve for the specific combination of mode restart and retriggering the running automation from within itself.

That said, I agree with Taras that the pattern you’re following is not ideal (even if you say it’s been working, which of course isn’t great for you). My simplest suggestion would be to first change the mode to queued. Let currently running instances complete before the next starts.

If you carefully look at what you’re really doing here is that you have a while loop to turn that switch on repeatedly, every 5 min when the motion sensor is on, with the sensor with state off being the terminating condition. You can just write this as an actual repeat in the automation which must terminate when the motion sensor is not on. It will be a lot less opaque. That said, since you have this specific automation as an issue, but also other ones, tackle them one by one. I’ve read and reread your original code several times and I think for this specific case Taras’ version really does solve it in the simplest way. If you still have flaky sensors one could also employ a wait for trigger (seems like that originally was an issue but maybe not anymore).

My suggestion would be: Implement his version and then post the next problematic automation to be looked at. There might well be another pattern to spot when we see more automations.

Thanks for adding to this topic. However, allow me to comment.

Well, it really shouldn’t. After all, the procedure of clearing out one job in order to restart it is a well established method in coding. It seems like it is that clearing out itself which is at fault. Triggering an automation several times (no matter what the automation may look like) leads to it getting stuck. This is demonstrated by several completely different automations which have been posted by the other participants in this discussion.

Again, that would be a different use case. I need to interrupt proceedings with my other automations. Alas, this may not be related to our present problem at all.

Could you perhaps add a link to this discussion to the Github issue?

How is it a different use case? You’re doing nothing meaningful in the automation after the point of retriggering it. You should try it at least as a test, as it can confirm whether it’s the restart behaviour causing it to hang.

The problem isn’t clearing a job. The problem is the (potential) self-reference. The previous run of the code may not be able to complete, or if it’s in a thread, that thread could become orphaned if you start a new execution. It’s like terminating a thread within itself when you don’t allow it to terminate normally. You’ve implied that you can code, so you should then understand this.

You said that is an example of a simple automation that “stopped working”.

What do you mean by “stopped working”?

  • Did it produce any traces?
  • Are their error messages in the Log for this automation?
  • Does it no longer trigger?
  • Does it trigger but fail to turn on the two switches?
  • Did Home Assistant disable it?

The Issue already contains a link to this topic. It was present when I read the Issue yesterday and commented here that it was incomplete.

1 Like

As a side note, you should put as much information in the Issue instead of linking to this thread. This thread has a lot of back and forth unrelated to the issue at hand.

1 Like

The details, which are not relevant for our problem: The sensor is extremely wonky. It will report motion for 90 seconds straight and then just sleep for about a minute or two afterwards not reporting anything. Using the example given I’ll end up sitting in the dark for up to 2.5 minutes sooner or later. Queue the job, on the other hand, might keep the light on for much longer than needed.

Anyway, I reverted back to 2023.7.3 where everything is working well. This is essential since there is a person suffering from dementia living here.

That’s what I meant, clearing out orphaned threads might be a potential source of that problem. Alas, the problem also occurred with single automations (see above). As aforementioned, focusing on self-triggering is presumably a red herring.

I know you were not asking me. However, the problem manifested itself in the automation getting stuck, i.e. it could not be neither terminated nor restarted nor even disabled. For me, only a complete restart would reset it.

BTW, sorry for missing the link in that Github issue. My bad.

It remains to be seen whether the problem with the posted simple automation is related to what you’re experiencing.

it’s most likely the change that was made for service calls that return data. Changes were made to blocking/non-blocking service calls. There were some policies that were removed for timeouts on service calls (when the action doesn’t complete within 10 seconds) and they probably need to be reinvestigated. There is definitely a problem here.

Outside this issue, if your device has a robust connection to HA, you shouldn’t see this issue at all because the services won’t fail to execute within the timeout.

This is the current speculation on the problem at hand. I’m not sure if it’s the actual issue.

1 Like