Make actions reliable

AxelBerger · November 9, 2024, 10:20am

All networks are more or less unreliable and can drop packets. For sensors that’s immaterial, if one value is missed the next one will come. But actions are invoked only once. For example I have a dimmable shelly bulb “Arbeitsplatz” that offers a soft automated nichtlight and is turned off in the morning. That shelly more or less regularly drops its connection for about half a minute. Usually that happens several hours apart, the 15 minutes below are an exception. As the logfile shows the lamp misses its being turned off (Licht aus).

action:
  - metadata: {}
    data: {}
    target:
      entity_id:
        - light.arbeitsplatz_2
    action: light.turn_off
mode: single

Arbeitsplatz turned on
07:44:41 - 3 hours ago
Licht aus sunrise with offset
07:44:18 - 3 hours ago - Traces
Arbeitsplatz became unavailable
07:44:08 - 3 hours ago
Sun Next rising changed to 10 November 2024 at 07:41
07:39:18 - 3 hours ago
Sun rose
07:39:18 - 3 hours ago
Arbeitsplatz turned on
07:32:25 - 3 hours ago
Arbeitsplatz became unavailable
07:31:52 - 3 hours ago

For critical settings (heating and other high-power units) I do not switch an appliance but set a variable. I then make a time trigger automation that sends the desired state to the appliance every minute or so. This will reset it even after a power outage or any other disturbance. Itr would be nice if something like that could be implemented for actions as a background service.

- id: '1722081581503'
  alias: Relais-Setzen
  description: ''
  trigger:
  - platform: time_pattern
    minutes: /1
  condition: []
  action:
  - service: modbus.write_coil
    data:
      hub: modbus_hub
      address: 0
      slave: 1
      state: '{{states(''input_boolean.mbr1'')}}'

fleskefjes · November 9, 2024, 10:22am

Your network is not healthy if this happens regularly. You don’t state what kind of network this is, but in general I would recommend fixing the issue rather than making a workaround.

tom_l · November 9, 2024, 10:23am

That is not normal. Fix your wifi coverage and you will have no issue.

WallyR · November 9, 2024, 10:52am

Also make sure that your Shelly is not running firmware 1.4.0

AxelBerger · November 9, 2024, 4:43pm

Current version: 20230913-111821/v1.14.0-gcb84623

MrEbbinghaus · November 9, 2024, 7:05pm

The issue you have is a real thing and not just “Fix your WiFi.” Stuff in the real world happens, and systems should be resilient.

The problem you have is a real one, not just “fix your WiFi”. Unexpected things happen in the real world and systems should be resilient.

A state-based model with a constraint-solving system would be very nice indeed, and could enable very simple yet powerful automations that have self-healing capabilities.

Imagine you define a constraint: When it’s evening, then the lights’ temperature should be 2500K. Then it doesn’t matter what happens to the light, if it is turned on via an external switch (so not controlled by HA), or an automation (controlled by HA), or if it is already on and needs adjustment. (No event that turned on the light.)
(There are integrations like Adaptive Lighting, that work like this.)

The problem with such a system is that it’s much harder to implement.
It would be a terrible waste of resources if the system had to check every constraint in a loop as fast as possible. Instead, the system would have to work out when to do the checks. (i.e. When the lights go on/off. When HA finishes booting, When the light is reconnected, …)

For practical advice that you can follow with any automation:
Don’t use the automation triggers for your logic.

Trigger: Sun sets
Actions: Turn on the light

Bad: There isn’t really a check if the sun has set. This would break if you run the automation with automation.trigger or add another trigger in the future.

Better:

Triggers:
  - Sun sets
  - Someone comes home
Conditions:
  - Sun below horizon
  - Someone at home
Actions: Turn on the light

If you want the conditions to be checked even when you run the automation by hand, then move the conditions to the actions section.

For such “constraint automations”, consider adding a special label and then adding an automation that triggers on major events and coordinates the checks. You can use this as a strategy to dynamically decide how often to run the checks, saving resources.

triggers:
  - trigger: homeassistant
    event: start
  - trigger: state
    entity_id:
      - binary_sensor.connected_to_the_internet
  - trigger: time_pattern
    minutes: /10
actions:
  - action: automation.trigger
    data:
      skip_condition: false
    target:
      label_id: constraint

A central coordinator could also be useful if, for example, you have an automation A that needs to run to provide the condition for automation B.

You can catch call_service events, fix the situation and send the event again.

For example, I have a network switch connected to a smart plug.
For a wake-on-lan event to reach my PC, the switch and therefore the plug must be turned on.

So the following automation triggers when the WoL button is pressed and the plug is off, then turns the plug on, waits until the network switch is on and then tries to turn the PC on again.

alias: When computer WoL and outlet off -> turn on outlet and re-emit
description: ""
triggers:
  - alias: When Bob WoL is called
    trigger: event
    event_type: call_service
    event_data:
      domain: button
      service: press
      service_data:
        entity_id: button.bob_wol
conditions:
  - condition: state
    entity_id: switch.outlet
    state: "off"
actions:
  - action: switch.turn_on
    target:
      entity_id: switch.outlet
  - alias: Wait for network switch to be available
    wait_for_trigger:
      - trigger: state
        entity_id:
          - device_tracker.switch_wohnzimmer
        to: home
    timeout:
      minutes: 1
  - action: button.press
    target:
      entity_id:
        - button.computer_wol

Or re-emit the event as is:

  - event: call_service
    event_data:
      domain: "{{ trigger.event.data.domain }}"
      service: "{{ trigger.event.data.service }}"
      data: "{{ trigger.event.data.service_data }}"

There is a possibility of creating an endless loop, so be careful with this type of automation.

tom_l · November 9, 2024, 9:49pm

Agreed, and I heard you the first time but they did say:

Which is not normal at all. So in this case fixing their wifi coverage is a very good first step rather than covering up the issue with a software fix.

MrEbbinghaus · November 10, 2024, 1:49am

Now that you mentioned it, I have to leave it.
Now that you mentioned it, I have to leave it.

The unreliable lamp is only an example. The thread itself is about the general issue.

Fixing the nightlights Wi-Fi won’t give you peace of mind that your heater will be turned on in the morning in winter, even if something failed temporarily.

tom_l · November 10, 2024, 2:59am

With proper network coverage (wifi and zigbee) my automations never fail due to undelivered commands.

Maybe for something critical like a pool filling pump that would be warranted.

Cake1468 · November 10, 2024, 3:22am

There is a HACS integration called Retry.

I had some light bulbs that did not tolerate being firewalled from the internet. They would reset themselves every 10 minutes and be down for around 30 seconds. I used Retry as a bandaid.

In the end I gave in and let them phone home. I don’t think I’ll buy Kasa bulbs again unless they fix that problem.

Perhaps Retry may be a solution for you.

AxelBerger · November 11, 2024, 12:01am

Thank you for getting my point. After tom_l’s insistence I checked the logfiles over several weeks. It is only ever that one lamp, never any one of the many other units and never for longer than about half a minute. But as you say, my point is more general and valid even if that were not the case. Sticking with lamps, they usually come full on after any power outage, intentional or otherwise. Some even start flashing which is most annoying. HA won’t know about any of this. My suggestion is not doing (some) actions just once but saving a desired state. Resending a “you should be this state” every ten minutes or so would add negligible load. Apart from very exceptional cases it needs never be more frequent than that. I gave a working solution above. A background service provided by HA as one kind of action would be nice to have, but that’s just a suggesting. Doing something like my solution for cases, in which a malfunction would constitute a major annoyance, is just a suggestion too.