Improving automation reliability

amitfin · April 10, 2023, 7:13pm

The Need

Smart homes include a network of devices. A case of a failed command can happen due to temporary connectivity issues or invalid device states. The cost of such a failure can be high, especially for background automation. For example, failing to shutdown a watering system which should run for 20 minutes can have severe consequences.
Any machine or device can break down, so it’s not possible to have full guarantee without redundancy (which is most likely less relevant for smart homes). But it’s possible to identify failures and try mitigating them, which can increase significantly the overall reliability of the automation.

The Solution: `retry.call`

The custom integration adds a single service - retry.call. This service warps an inner service call with background retries on failures.

Our Results & Experience

We have 56 automation rules, with a total of 75 service calls in them. 51 of the service calls (~70%) were migrated last month to retry.call. The other 24 service calls are either not relevant or not suitable to this type of solution, as it has its own limitations. 34 of the 51 retry.call (~66%) are also passing the expected_state parameter (see more about this parameter below).
The result is a significant reliability increase of the automation. It’s not so rare to see in the log file that retires were used, but we never saw a failure of all retries (there is limit on the amount of retries).

Usage

Instead of:

service: homeassistant.turn_on
target:
  entity_id: light.kitchen

The following should be used:

service: retry.call
data:
  service: homeassistant.turn_on
target:
  entity_id: light.kitchen

It’s possible to add any other data parameters needed by the inner service call.

Logic

The inner service call will get called again if one of the following happens:

The inner service call raised an exception.
One of the target entities is unavailable. Note that this is important since HA silently skips unavailable entities (here).

The service implements exponential backoff mechanism. These are the delay times of the first 7 attempts: [0, 1, 2, 4, 8, 16, 32] (each delay is twice than the previous one). The following are the offsets from the initial call [0, 1, 3, 7, 15, 31, 63].

Optional Parameters

By default there are 7 retries. It can be changed by passing the optional parameter retries:

service: retry.call
data:
  service: homeassistant.turn_on
  retries: 10
target:
  entity_id: light.kitchen

The retries parameter is not passed to the inner service call.

expected_state is another optional parameter which can be used to validate the new state of the entities after the inner service call:

service: retry.call
data:
  service: homeassistant.turn_on
  expected_state: "on"
target:
  entity_id: light.kitchen

If the new state is different than expected, the attempt is considered a failure and the loop of retries continues. The expected_state parameter is not passed to the inner service call.

Notes

The service does not propagate inner service failures (exceptions) since the retries are done in the background. However, the service logs a warning when the inner function fails (on every attempt). It also logs an error when the maximum amount of retries is reached.
This service can be used for absolute state changes (like turning on the lights). But it has limitations by nature. For example, it shouldn’t be used for sequence of actions, when the order matters.

Install

HACS is the preferred and easier way to install the component, and can be done by using this My button:

Otherwise, download retry.zip from the latest release, extract and copy the content under custom_components directory.

Home Assistant restart is required once the integration files were copied (either by HACS or manually).

Adding Retry integration to your Home Assistant instance can be done via the user interface, by using this My button:

It’s also possible to add the integration via configuration.yaml by adding the single line retry:.

Feedback

Feedback, suggestions, and thoughts are more than welcome!

MaxK · April 10, 2023, 7:32pm

I, and others, use templates to detect unavailable entities so that we can use automations to act on this before it becomes an issue. I am notified immediately when there are any unavailable entities - being notified by a failed service call could be too late.

Otherwise, an interesting concept and implementation.

amitfin · April 10, 2023, 9:37pm

Completely agree about proactive monitoring of unavailable entities. We have similar automation with (email) notifications.

Here is the flow which was the reason behind the availability check:

We have many Shelly Gen1 devices. When there is a temporary connectivity issue, the HTTP failure of the command (e.g. on / off) causes the entity to become unavailable (here and here).
The next inner service call retry skips the entity (since it’s unavailable) and retry.call is considered success (although the operation never succeeded on the entity.)
The entity availability check ensures the retries loop doesn’t end unless the service call was indeed called on the entity.

varn-lt · May 15, 2023, 6:08pm

hey, where can the problem be? I’m running HA on docker, installed it via HACS but when I try to use it I get an error Unable to find service retry.call

amitfin · May 15, 2023, 8:04pm

Did you add it to the configuration?

arva · November 19, 2023, 1:58pm

Hi!

I’m having shellies and the “unavailable” issues as well. To this day i have mitigated this with a wait_template.

sequence:
          - wait_template: "{{not is_state('switch.pool_heater', 'unavailable')}}"
            continue_on_timeout: true
          - type: turn_on
            device_id: c6d83e9536c7d12ee2bfe51f477eaaa1
            entity_id: switch.pool_heater
            domain: switch

This works nicely when i have only one entity in the automation, but when i want to switch on all outdoor lightning it’s five entities in total and then the wait_template will become a mess, because it’s a sequence and if the first entity is “unavailable” then it will wait for it to become available before switching on others.

How does retry.call handle this situation? For example if i have:

service: light.turn_off
data: {}
target:
  entity_id:
    - light.front_entrance_outdoor_lights
    - light.front_yard_lights
    - light.back_yard_outdoor_lightning
    - light.balcony_lights
    - light.pergola_lightning

amitfin · November 19, 2023, 3:01pm

retry.call executes each entity individually. For example, if there are 5 entities and 4 are available, only the unavailable one will be called again.

Here is the relevant paragraph in the page of the integration which explains that:

Service calls support a list of entities either by providing an explicit list or by targeting areas and devices. The call to the inner service is done individually per entity to isolate failures.

arva · November 19, 2023, 6:35pm

I have a issue.

I have PIR in sauna washing room and automation that will turn on the light when PIR detects movement and after PIR detection stops there is a 5 minute delay and then light will be switched off.

I added your integration to the automation to switch on and off light. My son was showering and then light was switching off all the time, even he moved and PIR was active.

Configuration:

          - service: retry.call
            data:
              service: light.turn_off
              expected_state:
                - "off"
            target:
              entity_id: light.sauna_wahing_room_lights

Here is the log:

Looks like retry.call was heavily trying to switch the light off .

Only reason I need this intregration is to mitigate the issue of my shelly-s networks issues that enitites are unavailable for brief moments.

So what i pretty much need is a retry function that will check if the enitity is unavailable and retry until it’s available. Checking if it’s in expected state is also good, but seems that something is not working as it should. I used to use this in automation sequence:

          - wait_template: "{{not is_state('switch.pool_heater', 'unavailable')}}"
            continue_on_timeout: true

amitfin · November 19, 2023, 7:10pm

I don’t think I fully understand the automation rules, but the problem seems to be in the triggers of the rule which shouldn’t be running at all (the lights should be kept on), and not in its actions section.

Regarding the expected_state parameter: this is completely optional. You can remove it and then the retry will happen only when the entity is unavailable (which is your primary goal) or when there is an exception (which doesn’t seem to happen here). However, I don’t think that keeping it should make a different. It’s advised to set it when possible.

arva · November 19, 2023, 8:09pm

Here is my full automation. I did take away the expected state. Let’s see tomorrow how it acts


alias: Sauna Lights PIR
description: ""
trigger:
  - type: motion
    platform: device
    device_id: 131101f1c80e1ac3e6c3eb3af1f922db
    entity_id: binary_sensor.sauna_pir_motion_detection
    domain: binary_sensor
    id: PIR ON
  - type: no_motion
    platform: device
    device_id: 131101f1c80e1ac3e6c3eb3af1f922db
    entity_id: binary_sensor.sauna_pir_motion_detection
    domain: binary_sensor
    id: PIR OFF
condition: []
action:
  - choose:
      - conditions:
          - condition: trigger
            id: PIR ON
          - condition: state
            entity_id: light.sauna_wahing_room_lights
            state: "off"
        sequence:
          - service: retry.call
            data:
              service: light.turn_on
            target:
              entity_id: light.sauna_wahing_room_lights
      - conditions:
          - condition: trigger
            id: PIR OFF
          - condition: state
            entity_id: light.sauna_wahing_room_lights
            state: "on"
        sequence:
          - delay:
              hours: 0
              minutes: 5
              seconds: 0
              milliseconds: 0
          - service: retry.call
            data:
              service: light.turn_off
            target:
              entity_id: light.sauna_wahing_room_lights
    default: []
mode: restart

frafro · December 6, 2023, 2:27pm

Hey Amit,
I love what you have created as it is almost the solution for my window blinds sometimes not closing as they might get stuck on their way down.

So, repeating the command to close is close to the perfect solution to me.

But the tricky part ist the “expected state”:

As soon as I try to close the state is “closing”, if it gets stuck it will open again.

If I set the expected state to “close” it will retry several times as closing the binds 100% takes maybe 20 seconds.

So, what I’d love to have is that “Retry” waits for 20 seconds (in my case) to check if the expected state (“close”) has been reached.
And will only do the second, thrid, … retry when after 20 seconds (or whatever time I define) the expected state of “close” is not reached.

Do you things this might be possible?
Take care
Frank

amitfin · December 6, 2023, 4:38pm

@frafro , can you elaborate on the problem with “closed” expected state? Is there an impact of sending an additional “close” command while the blind is closing?
(BTW, not sure if and what is the physical impact of insisting to close a blind after it got stuck.)
Anyway, I’ve got several asks to parameterize GRACE_PERIOD_FOR_STATE_UPDATE, so perhaps it’s about time to do so.

amitfin · December 6, 2023, 8:38pm

v2.5.0 adds state_grace parameter to control the grace period of the expected state check.

frafro · December 7, 2023, 11:36am

Hi Amit,
well, it is not a problem issuing the service call several times while the blinds are closing.
But when I want to use the new “repair” feature as a notification of a final failure (when the blind is really stuck),
I get 2-3 “repairs” until the blind is closed even if there are no problems as it takes some time to reach the final state “closed”.
I will try the new grace_period parameter tonight and report tomorrow.
Thanks
Frank

amitfin · December 7, 2023, 1:08pm

Hi Frank,
A repair ticket is issue only after the failure of the last attempt.
(Commenting about it here to make sure other people get the accurate data.)

Will be happy to know if state_grace (or otherwise) has solved your scenario.

frafro · December 7, 2023, 1:59pm

Hi Amit,
well, yesterday I had 50+ repairs for my 5 blinds going down at sunset.
And none of them was finally stuck.
I will try to figure out what if going wrong on my side.

By the way, I found a minor typo in the new version:
“Grace period (seconds) for expected state (defualt is 0.2).”

OK, that is what is happening to one of the blinds if I close it (with grace period).

15:04:43 - Rollo AZ was closed
15:04:40 - Rollo AZ was opened …?
15:04:27 - Rollo AZ closing ausgelöst durch Dienst Retry: Actions …first retry?
15:04:27 - Rollo AZ opening ausgelöst durch Dienst Retry: Actions …?
15:04:25 - Rollo AZ closing ausgelöst durch Dienst Retry: Actions

So, takes around 18 seconds from the first “closing” at 15:04:25 to the final “closed” at 15:04:43

No idea where the intermediate messages are coming from.

Strange but true: no “Repairs” with our without the grace_period today.

But this was only for one blind, lets see what happens if the automation closes all of them at the same time at sunset…

Take care

Frank

frafro · December 10, 2023, 2:40pm

Hey Amit,

grace_period has solved my issue.

It takes some time for repairs to appear, it is not “real-time” I think.

But anyway, your integration has made my day.

Take care
Frank

danhiking · December 15, 2023, 6:45pm

This is just what I needed! Opening/closing the blinds can be unreliable.

Is there a way to check the state of another entity?

The cover returns the state prematurely. It can take ~10 sec to open or close the blinds, but HA reports “open” or “closed” state immediately. However, I have another entity which is the percent open state of the blinds.

I’d like to call cover.open on the blinds, but then verify the state 15 seconds later, based on the percent open entity. Is there a way to achieve this?

os.habitats.tech · December 16, 2023, 6:22pm

with latest HACS 1.33.0 and Retry 2.5.0

I have an interesting failure in an automation. Using Tuya devices based on official Tuya integration.

If I use the Retry service the automation fails. However if I do not use the Retry service the automation succeeds.

I am not certain if this a device issue, Tuya integration issue or something else.

Trace when it succeeds:

Trace when it fails:

amitfin · December 17, 2023, 8:46pm

v2.6.0 adds validation parameter for providing a template with a boolean expression. It’s possible to refer to any entity in the expression, so it should address the scenario. The release notes contains additional important information, so please read it.
@danhiking , please let me know if it’s indeed working for you. (The release is still in beta, so beta releases should be enabled for the component).