Continue on Error as the default

AJolly · June 5, 2024, 11:08pm

Which is why I’d like to have automatic ignore error but alert.

Today: My morning automations did not complete, because I had unplugged my air purifier and forgot to turn it on.

Yes it’s useful to realize that, but it would have been more useful for the automation to run the rest of the things, and then notify me that that piece had failed.

I can understand the logic behind not wanting it as a default, but I still feel like you should let advanced users choose how they want to interact with their hardware.

At least a gui option to turn it on, even if there isn’t some global default.

Or a way to tell HA: “Yes this integration is flaky, always continue on error”

tom_l · June 6, 2024, 2:52am

You can do that.

Use an automation to monitor your system log for errors and send a notification if one occurs.

AJolly · June 6, 2024, 6:27pm

Hmm good idea, right now I have watchdogs that check if a boolean gets successfully set. I’d still like a way to set continue on error on as the default for a certain automation or integration.

nathanCFD · June 7, 2024, 3:46pm

Hi Tom, that’s true for sure. But, what I can echo from my own recent experience is it’d be super helpful to have a way to log and notify on “more trivial” errors instead of stopping the script entirely.

Similar to PHP, for example: The difference between “warnings” and “fatal errors”. One tells you something’s not right and you should look at it while the other stops the script from continuing to execute. You can specify the default behavior for warnings to “log and continue” instead of hard-fail.

Having a way to do that either globally or on the “whole script” level would be nice.

I’ve never experienced this issue until last night. Went downstairs and everything was dark when it should not have been. Checked the logs and my light schedule had stopped because an unreliable Sengled bulb had dropped off the network. There’s 20 other lights in the script; it would have been better to have those others go ahead and turn on as expected. Assume it’s going to continue to do this until I find, order and replace those flaky bulbs. (Aside: if anyone has a suggestion for a non-proprietary-hub zigbee bulb, I’d appreciate it)

Strange thing is: Those Sengled bulbs drop off all the time; and the scripts never stopped before - odd that it decided to do it for the first time last night; haven’t updated HA or any integrations.

Just my two-cents on the issue; I’ll add the continue on error for those bulbs on the scripts that run daily until I can get the replaced.

AJolly · June 11, 2024, 3:45pm

I’ve been pondering this, and what I’d really love is to be able to add some error handling within an automation itself.

Be able to have a try catch block, specify if a step takes too long or doesn’t return successfully then run an alternate code path.

The only way I seem to successfully get automations to always complete is to wrap them in a parallel block and add continue_on_error everywhere.

ampersandru · June 18, 2024, 6:40pm

I dont think anyone is asking to hide the problem. I have hundreds of devices utilizing multiple protocols from cloud, zwave, zigbee, Bluetooth, etc and its a pain to make sure EVERYTHING is working for automations to execute fully.

If the HA team wants HA to become more mainstream, things like this need options and proper alerts so people can know what to do and not kill the automation dead in its tracks because of one cloud device that stopped responding or timed out.

matt_heinrichs · June 26, 2024, 6:49pm

I would suggest something like putting any automation that fails with an exception goes into the “Repair” section as an “easy” next step? My script is broken…
I sometimes notice something didn’t happen, dig into Traces, only to find something has stopped working for the last few days. Generally something going offline.
I understand I’m responsible for better exception handling in my scripts, but it would help if the automations supported some standard mechanisms for exception handling (try-catch?).

ldf · June 27, 2024, 8:24am

+1 to this, this setting should be available in the UI and there should be an automation wide on/off default setting from UI and yaml.

ampersandru · July 20, 2024, 3:51pm

Since updating to 2024.7, continue on error seems to be ignored

One of my locks sporadically works (thanks zwave 700 issues) and as you see in this trace screenshot, it completely stopped at that point

KE55ARD · August 23, 2024, 5:10pm

Throwing my 2 cents in on this. This feature is absolutely necessary and the best thing to do in my opinion. Especially if you don’t actually want to spend your entire life obsessing over which device is causing loads of things not to work.

It would not hide the problem when something doesn’t work, it would actually make it much clearer! The thing I hate about the current behaviour is that when 1 thing breaks, it’s not obvious which thing broke my script/automation, because maybe 5 things didn’t happen.

If I triggered a script and everything worked perfectly every time except 1 light or plug, it’d be DAMN obvious which device was at fault! And I could choose to leave that device working as well as it does, or eventually fix/replace it.

But now I’m basically held hostage to spend time/money on fixing an issue or at least unnecessary effort restructuring a script to work around one device that sometimes doesn’t work right for seemingly no good reason. And in my experience, the moment I do that, something else chooses to go on strike, and it’s a never ending merry go round of moving things around when in reality I couldn’t care if one thing occasionally failed…

MaloW · August 29, 2024, 9:43am

+1, out of my roughly 100 automations I would really only want 1 or 2 to stop running if it encounters an error, continue_on_error should absolutely be the default behavior, with a stop_on_error option instead.

dzerovibe · September 12, 2024, 3:13am

+1 for Continue on Error. I just got bitten by this again last night when an automation failed due to a single Zigbee device being unavailable. I wish there was a way to allow continuing on errors by default. In my case, I would probably want this for 99% of my automations.

jurgenweber · October 21, 2024, 10:21pm

I agree with this ,but the problem I face is sometimes my internet goes down… and I basically have to put ‘continue_on_error’ which involves anything with a cloud integration.

a “ignore this step when there is no internet because my cloud integration won’t work” would be rad.

SphtKr · October 25, 2024, 12:23pm

Understand not wanting to make this the global default–or even not wanting to make it possible to change it to be globally on.

But, being able to ignore errors for a whole script is definitely worth doing. And being able to set it via the GUI per script block is definitely needed.

And here’s another middle-ground suggestion: a “wrapper” block similar to “if else” that works like “try catch”, so anything within the “try” block is “continue_on_error: true”. This could have an option to jump to the “catch” block on the first error sequentially, or treat every block as “continue_on_error: true” and only execute the “catch” block if any errors occurred (perhaps some variable provided to the “catch” block would contain the first error… Traces would ideally show any errors that occurred.

My use-case: I have almost 100 Z-wave devices, and I have a few scripts that turn everything in the house off or down at night or when leaving the house. Statistically, the odds of any device missing the message grows very high, even with a fairly healthy network. At the moment something in zwave-js’s retry strategy seems to have changed so I am more often getting ZW202 failures with only 1 try (I can’t find where to change this)… so my script that shuts everything off and arms the alarm fails quite often. Complicating matters, this script is actually turning off several group helpers, so it is not at all obvious which device is failing from the Traces.

EDIT: Forgot to mention that the fact that adding “continue_on_error: true” to the YAML breaks GUI editing for every block it’s applied to is a BIG downer and big step backward in usability…hence why I say GUI support for that is “definitely needed”. Looks like this is maybe in the back of the devs’ minds, because a little icon appears in the editor when it’s enabled… still.

AJolly · October 27, 2024, 7:12pm

I can’t even get continue on error to work reliably for me.

That being said, if you want continue as error everywhere:
python_scripts/update_continue_on_error.py

import yaml
import sys
import os
from datetime import datetime
from shutil import copy2
from typing import Any, Dict, List

def add_continue_on_error(data: Any) -> Any:
    if isinstance(data, dict):
        if 'action' in data:
            if isinstance(data['action'], list):
                for action in data['action']:
                    if isinstance(action, dict) and 'continue_on_error' not in action:
                        action['continue_on_error'] = True
            elif isinstance(data['action'], dict) and 'continue_on_error' not in data['action']:
                data['action']['continue_on_error'] = True
        else:
            for key, value in data.items():
                data[key] = add_continue_on_error(value)
    elif isinstance(data, list):
        return [add_continue_on_error(item) for item in data]
    return data

def represent_none(self, _):
    return self.represent_scalar('tag:yaml.org,2002:null', '')

yaml.add_representer(type(None), represent_none)

def create_backup(file_path: str) -> str:
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    backup_path = f"{file_path}.{timestamp}.backup"
    copy2(file_path, backup_path)
    return backup_path

def process_yaml(input_file: str) -> None:
    # Create a backup
    backup_file = create_backup(input_file)
    print(f"Backup created: {backup_file}")

    # Read and modify the YAML
    with open(input_file, 'r') as file:
        data = yaml.safe_load(file)

    modified_data = add_continue_on_error(data)

    # Write the modified data back to the original file
    with open(input_file, 'w') as file:
        yaml.dump(modified_data, file, default_flow_style=False, sort_keys=False)

if __name__ == "__main__":
    input_file = '/config/automations.yaml'
    process_yaml(input_file)
    print(f"Modified YAML has been written back to {input_file}")

configuration.yaml

  update_continue_on_error: "python3 python_scripts/update_continue_on_error.py"

automation:

alias: Update Continue On Error
description: ""
trigger:
  - platform: time
    at: "07:16:00"
condition: []
action:
  - metadata: {}
    data: {}
    action: shell_command.update_continue_on_error
    continue_on_error: true
mode: restart

ultrazero · November 16, 2024, 2:00pm

I must say, I’m in camp “make this a feature” as well. The ability to enable at the automation level seems like a fair compromise.

@tom_l would you mind sharing how I would set up an automation to monitor system logs for errors on automations? Is there a template you might point me to? To your point, addressing issues with hardware is ideal.

Thank you.

dmcentire · November 20, 2024, 5:29pm

I’m running into this as well, just like many folks. It’s true - no matter how well you plan, execute, and follow procedures things will either break, go offline because you accidentally turned off / unplugged something, or whatever.

Breaking an entire automation from continuing may hide the problem, but it sure keeps things running otherwise when you have automations that do things automatically like turn on and off lighting.

Just yesterday I had a zigbee outlet fail (hardware failure, making noises) and it ended up breaking all the time triggered automations because it included a small LED lamp in a hallway as part of the automation.

Because it was one of the first devices the automation acted on, the rest of the actions failed when this device became unreachable. I knew about it when it did not turn off at night, and testing the device manually failed, so I knew it needed to be replaced. I did not want to dive into all that work last night since it was a wall outlet, but also I didn’t want to delete the zigbee device as I didn’t know how bad the automations would break when the device was non-existent vs. unreachable.

Anyway, I changed the outlet this morning and ensured that I used the same entity names for the outlets & usb port so my automations fell into place without any need to edit them. And things are back to normal now, but again I’d like to have this feature easily set to on by default (the continue_on_error flag).

I have editing my time automations to include this in every action now, but having some easily set setting in HA would be a nice feature in my opinion.

enp6s0 · November 26, 2024, 3:04pm

Sweet Jesus please add this at a per automation level.

I do not care how sloppy it is. It is basic usability for a normie.

matrover · December 9, 2024, 4:14pm

please vote to have this (maybe) solved this in the WTH of dec '24: WTH can’t continue on error be set for entire automation

Rod_Poplarchick · December 26, 2024, 4:04pm

The big problem with this option is that it does not work if a cloud-based action fails.
Because the main code {not easy editable} only allows native HA integrations to be ignored.
And my automation errors only happen because of cloud based integration fails such as us able to connect to Spotify.

    # Only Home Assistant errors can be ignored.