With 2023.7 more of my scenes were throwing errors and more of my automations were failing. There have been a few reports and a few defects logged and some changes were made to ZHA which might improve things but I wanted to find out what the HA philosophy is on error handling.
According to this defect which was closed as working as expected, scenes should fail when they can’t set any device. Since a lot of smart devices are a bit inconsistent, this means some automations will fail more often. I also have scenarios where things like a Hue bulb is unavailable because a light switch is off or holiday lights that are automatically shut off at night are not available in the summer. It was bad enough I restored to 2023.6.
To me it seems like the old way, of continuing when a service such as turn on fails for one device has a better Home Approval Factor. I understand that we can add “Continue On Error”: true to scripts. Apparently it is also allowed for automations and scenes but I haven’t seen that documented and it would mean that many of these things could no longer be edited in the visual editor.
Does anyone else have thoughts on this subject? What is the future of error handling in Home Assistant?
Not sure what you mean by “the old way”. Older versions of HA, the automations failed and the actions stopped immediately if a service call produced an exception. Continue on error was added to allow the automation to continue if a service call produces an exception (this did not disable the automation, this just stopped the current run).
Is this what your post is about, or are you referring to some other functionality?
Also, the only change that occurred from 2023.6 to 2023.7 is that automations that encounter an invalid service (Note: this does not mean an exception from the service, it means the service just doesn’t exist) or missing entity, the automation STAYS in the automation list. In 2023.6, the automation was completely removed (unknown to the user), now it stays and shows you that you need to fix it.
If an entity was missing in 2023.6 and I activated a scene that referenced that entity, it ran successfully. There was an error in my log but things worked. For example I have a decorative light in my office that went offline and I had to delete it. There are still some scenes and automations that reference it that I haven’t updated yet. I can still activate those scenes and still call the light.turn_off service for that entity and the other lights in my office still behave. There is an error in my log but my automations are not broken.
Another user had a saturated zigbee network. In 2023.6 it would log the error message and maybe skip activating a light. In 2023.7, the automation would fail and stop.
When I say “the old way”, I was referring to scenes and automations continuing if a device failed to respond to turning on/off when invoked through the turn on service or an activate scene service.
In 2023.7, those automations fail. My night time automation fails when it gets to a holiday light that is offline and does not continue. I understand that I can clean these things up or add continue_on_error to some things by editing yaml and I’ll get the 2023.6 behavior. It’s not documented except for scripts, it isn’t in the UI, and adding it to everything will be pretty tedious.
But aside from all that, what I intended for this post to be about is: what is HA’s philosophy on an automation encountering an error? Some platforms, such as almost all low-level programming languages would just throw an error. Other platforms would continue, attempt to do as much of the automation, scene, flow, etc as possible. What is the HA Communities position on this?
It has always been to: Immediately fail and stop the current run. If you experienced behavior where it continued, then that was a bug.
Yes, this is what continue_on_error is for.
Not sure what you mean by this. Typically the community is divided on every single topic that crosses it’s path. So I’m sure you’ll have people who want it one way vs the other.
I’d like to have continue_on_error as default,
now I have to remember to add this for every service call,
especially if you rename an entity stuff is tricky now.
but enabling this as default would probably mean a major breaking change for some people. because if A doesn’t happen, B should never happen.
so maybe adding it on automation level
I think the biggest problem with having this as the default behaviour is that it would encourage sloppy configuration and hide your mistakes so you don’t learn from them.
true, however,
I’ve had automatons stop in a scenario where I didn’t want this.
one time an error on html5 notification, and one time a notify service where I had a typo in the repeat.
In both scenario’s I’d rather had the automation finished. Spook informs me about sloppy config so I learn from them and can fix them.
I completely disagree with this. I have an automation that falls 20% of the time because it calls a scene that relies on one ZHA switch that likes to drop off the network.
Thats not bad code I wrote, that’s bad decisions by HA to completely abort on one unreliable device.
Absolutely not. Continue on error by default would be an inversion of the universal standard. The programmer (user in this case) and only he is responsible for ensuring errors do not occur. And if they do and the user has not provided a way to handle them, execution stops.
The script syntax does support continue_or_error but note that this only applies to exceptions built into HA.
This could explain my issues. Do you have more details on this? Is this info documented ? what menas “built into HA” ? e.g. If I have an error on HomeKit or Sonos integration … is it “into HA” ?
At then it says that ‘oh, by the way, this won’t work for non-Home Assistant errors’. Once I dug out that code and there’s a comment saying ‘We don’t want to ignore errors that are not from Home Assistant’, or something to that effect. Meaning that only native Home Assistant exceptions will be ignored. Custom exceptions won’t.
The precise explanation requires getting into programming. Only errors that inherit native HA exceptions will be ignored. The integration developer chooses whether to bother using HA exceptions, his own, or whatever the code he imported brings.
I believe it’s designed that way to prevent a system catastrophe due to ignoring potentially critical errors outside HA.
Thank you … this explains some things I’m facing …
Since “continue_on_error” doesn’t help me in some of the cases, probably I’ll create une “sub-script” for each call I suspect can have issues that are not “covered” by “continue_on_error” , and then just execute all sub-scripts so that it doesn’t break my main execution process if one sub-script fails …
I also agree for an option to always “continue_on_error: true”
I don’t agree that it enables sloppy automations/configurations. Its a failsafe, if one desires. To manually insert “continue_on_error: true” after each service is cumbersome. At the very least, this option should be integrated in the GUI and allowed to be enabled for each automation and/or service call.
I recently “upgraded” to a zooz 700 USB stick and there are a couple zwave products that like to drop off network. When they are dropped and one of my automation calls upon them, the automation will stops… why should it? It should be user-choice to always allow automations to keep going
Another example is with Eight Sleep, the “turn off side” service sometimes just doesn’t work. I would like my automations to continue even if that service call fails, it doesn’t affect the rest of the automation.
These are just a couple examples of many where sometimes integrations are just not 100%.
Which is why I’d like to have automatic ignore error but alert.
Today: My morning automations did not complete, because I had unplugged my air purifier and forgot to turn it on.
Yes it’s useful to realize that, but it would have been more useful for the automation to run the rest of the things, and then notify me that that piece had failed.
I can understand the logic behind not wanting it as a default, but I still feel like you should let advanced users choose how they want to interact with their hardware.
At least a gui option to turn it on, even if there isn’t some global default.
Or a way to tell HA: “Yes this integration is flaky, always continue on error”