ESPHome commands are not guaranteed - what workarounds?

I was fiddling with my new ESP32s and exposed LED D2 as a switch
I wrote a homeassistant automation to toggle all the devices in an area every 5 seconds then put all my ESP32s in that area. What I can see is that after starting in sync after a few tens of minutes the LEDs are out of sync

So it appears commands don’t always get actioned I cant see any error messages
Is this expected, accepted as normal and is there any simple way to cope with it ?
For example I want to trigger my watering system - I have coded it to turn on for five minutes then turn off independantly - but If my scheduler in Homeassistant sends the start and it doesnt happen - how do I know?
Progamming in yaml to check 10 seconds later is the solenoid on and if not try again is probably possible - but

Some people believe saying, “failure is not an option” means everything will be okay. Others know that even if failure isn’t an option it will find a way.

So, you need to understand the consequences of failure and then what you can do to know when it happens or how to prevent it.

In your example toggle is a generally bad idea for consistency. A long time ago when I started die the automation rabbit hole most of the readily available remote control (RF) switches were toggle. I quickly realized this was a bad idea for automation, but fine for human control. I found ones that had distinct on and off buttons and then reverse-engineered the protocol so I could control them with automation. I then sent the command 3 times. That provided really good reliability for me and was never an issue. But there were still lots of ways for it to fail, I just never saw those failures. I did see failures with the sensors.

With the esp8266 and esp32 based devices, I generally have them fail in a “safe” way which means turn off power. Of course, for a freeze protection system that is still a failure if there are freezing temperatures. For heaters, I put in high temperature cut outs so if the relays fails with the contacts shorted nothing gets too hot. It all depends on the consequences. An LED being off when it should be on is annoying, only you can decide what it is worth to make that not happen.

Toggle is just what I used to demonstrate that its unreliable
LEd is also just what I used and it demonstrates that its unreliable

Not watering things die , watering too much is a fine… with no support from the OS its a hard call to program this so it fails “safe”

Sending the command three times is ok - but I only want to happen once as my buttons add 5 minutes to the solenoid on time so if its hot I do it three times - which will now be 9 …still everything is solvable - I just didn’t realise the limitations of ESPHome and that its commands were not reliable(In the network TCP sense)

I can’t see a reliable way program around failures with no indication from the OS

You are likely barking up the wrong tree. If esphome receives command, it executes it. Or you have f…ed up the configuration.
Obviously if esphome doesn’t receive the command, it’s not aware of it. So the problem has to be resolved somewhere else.

Feel free to post your code and logs.

1 Like

I would suspect that the issue is more likely with the WiFi latency/connection than the ESPHome devices themselves. I had some TP Deco mesh devices that would constantly drop connections. Consumer Wifi gear is a lower standard than profesional gear. Upgrading to Unifi gear solved a lot of issues for me, but not everyone can afford this or have the skills.

Turn on and check the WiFi signal strength of the devices having issues. Cheap smart devices with low quality WiFi chips or poorly written drivers can cause issues for all devices on the network.

1 Like

If you need reliable action do it locally (on the esp device).
IF it is simply an action to toggle every 5 seconds just add an action in esp yaml to make action occur. On the HA side you should simple have it logging that the action did or did not occur and maybe an automation to notify you when a certain number of misses occurs within some time period (determine a failure rate and set automation to trigger on that).

I have ubiquiti AC pro access points in the room about. 5 Metres away

But regardless ,
the command does not get actioned
3 other devices in the same room do work
5 seconds later the command works
I could not find an error logged

How do I log that the action did not occur?

The device configuration for the relevant section

switch:
  - platform: gpio
    pin: GPIO02
    name: "On board D2"
    id: d2

The YAML automation

  • I go and edit the area of each device to move them into the target aread
alias: Every Minute toggle D2
description: ""
triggers:
  - trigger: time_pattern
    seconds: /5
    minutes: "*"
    hours: "*"
conditions: []
actions:
  - action: switch.toggle
    metadata: {}
    target:
      area_id: espdevelop
    data: {}
mode: single

One of the complete esphome definitions (the substituions is what changes between them)

substitutions:
  mac_suffix: "60cd54"
  friendly_name: Temp-$mac_suffix
esphome:
  name: esphome-web-$mac_suffix
  friendly_name: ${friendly_name}
packages:
  esp32oom: !include
    file: .esp32oom.yaml
    vars:
      mac_suffix : ${mac_suffix}

and the full .esp32oom.yaml file

logger:
    level: debug
api:
    encryption:
        key: !secret api_encryptionkey

esp32:
  board: esp32dev 
  framework:
    type: esp-idf
    advanced:
      minimum_chip_revision: "3.1"
ota:
  - platform: esphome
    password: !secret ota_password

wifi:
  ssid: !secret wifi_ssid
  password: !secret wifi_password
  # Enable fallback hotspot (captive portal) in case wifi connection fails
  ap:
    ssid: "RJWESP_${mac_suffix}"
    password: !secret wifi_password
captive_portal:

time:
  platform: homeassistant
  id: homeassistant_time
  on_time_sync:
          then:
            - logger.log: 
                format: "Time Sync"
                level: INFO
            
            
switch:
  - platform: gpio
    pin: GPIO02
    name: "On board D2"
    id: d2
binary_sensor:
  - platform: gpio
    pin: GPIO0
    name: "On board Boot"
    id: boot

Did you check HA automation traces? I would expect this to log an error somewhere. HA does know if command was received and executed or not.

Personally I’m also missing some more powerful options or methods to adjust how HA should react to such errors - but it’s possible they methods are there, I just don’t know them. In fact, I find it annoying that some errors may stop execution of the rest of the automation.

One thing that I found very suspicious about this example is that I find it unlikely the errors would happen so easily in these circumstances. WiFi protocol is built around the fact that some packets might be lost. With your setup I would expect it to be very reliable. I use quite a lot of WiFi devices and for local devices it just does not happen that a command is not executed (cloud shit is of course completely unreliable and the source of errors I mentioned). But, well, I haven’t done tests like this.

When you say the leds were out of sync, do you mean they would still switch at the same time, but some would at that moment turn on and some turn off, or they were not switching at the same moment at all?

I would suggest a new test with some extra automation to check a couple seconds after the first one if all devices leds have the same status and switch a helper boolean if not. Then you will have the exact time when this happened and can check the logs more easily for that time. You’ll probably also need to increase the amount of stored traces, as the default is 5.

As I said, I would expect automation trace to log the error. Now the question is, how to use that error logging in the automation.

Judging by this thread, there is currently no event fired when automation fails:

well you seem to be on the same page as what I am talking about. So I start with ALL LEDS OFF , I enable the automation and all LEDs turn on , 5 seconds later off etc , I leave , some time later , say 10 minutes , I come back and some of the LEDs are on and some off , but still toggling every 5 seconds. With only four ESP32s this happened in maybe 20 minutes with 14 ESP32s it happened within 10 minutes ,
Overtime many different EPS32s go out of sync, its not the same ones at one stage I had 8 on , 6 off but all still toggling every 5 seconds

I would way I seldom see the miss happen, that is I see each individual ESP32’s LED toggle every 5 seconds.

Overall I’m not surprised/concerned (not sure of the exact word) , I can program around it,
I’ve not found anything in the system log . I have turned logging down in ESP32 to info. BUT this is an HA error and it should log errors to the end user when errors occur and a miss in a command is an error.
I do get errors in the system log

Logger: aioesphomeapi.connection
Source: /usr/src/homeassistant/homeassistant/components/esphome/manager.py:513
First occurred: February 2, 2026 at 12:44:09 PM (38 occurrences)
Last logged: February 2, 2026 at 6:01:59 PM

esphome-web-2fd404 @ 192.168.67.22: disconnect request failed
esphome-web-2e9ec0 @ 192.168.67.32: disconnect request failed
esphome-web-a7cd44 @ 192.168.67.153: disconnect request failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/asyncio/selector_events.py", line 1005, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "aioesphomeapi/connection.py", line 1106, in aioesphomeapi.connection.APIConnection.disconnect
  File "aioesphomeapi/connection.py", line 904, in send_message_await_response
  File "aioesphomeapi/connection.py", line 886, in aioesphomeapi.connection.APIConnection.send_messages_await_response_complex
aioesphomeapi.core.ReadFailedAPIError: [Errno 104] Connection reset by peer

I assume this is it - but as you say what do I do with that?
I’ll go search for what you refer to “Automation traces”

Thanks

I have never had such connection errors from esphome in my system logs. Not sure if this is expected to happen occasionally or maybe you just stumbled upon some error in esphome.

Maybe the issue is indeed there’s a lot of packet loss, because the devices are close to each other and the command is sent at the same time to all of them? But that’s just my guess.

Either way, would be nice if there was a way to catch such errors within the automation and react on them, but I’m not aware of any. I suspect the automation stops and fails when this happens, to the point that if you had any other actions after this in the automation, they would not be done.

1 Like

OK, I knew there was at least a little bit more about this somewhere. Scripts documentation mention an option (that can be used in automations as well, as scripts are just the action part of automation) that allows ignoring errors:

So at least this could be used with some extra check within the automation to see if the device reacted.

Thanks very much - I am still coming to terms with the way the documentation is organised, so this looks like what I am after -
It looks like I need to make the automation a bit more specific - as from what I can see it should fail by default - but maybe toggle a whole lot of items when only one fails - is not caught as a failure, more to try

I think it should fail even in such case. I think it even fails if the target is a group entity and one of the elements of the group errors.

HA can still see the device state of switch or whatever you setup in esp device yaml.

You have automation in esp device yaml that turns switch on/off at intervals. In HA this switch will turn on/off as esp device switches it.

You can use that change in HA as trigger condition for automation.

Now let’s presume there is an issue preventing HA from seeing this change reliably. There are a few options but as I’m writing I would

Create sensor in esp device yaml
Increment sensor 1 time for each ON/OFF
HA will have this sensor data as well and this may be used as a secondary trigger condition

If not on in XX minutes and count < xx. Do action

The sensor would reset daily.
Even though it resets HA historical data will show increment so you can review and possibly add as trigger to notify when it’s not reaching xx count daily.

Just quick idea with no context of other monitor methods for your needs.

Point is, HA not needed for main function and better to use it as monitor for critical function vs main command/control

I don’t think the point is to trigger switch every X seconds, that was just a test to see reliability. The actual automation may need to be done within HA for various reasons.

Are you locking the ESP32 to a specific AP in Unifi? Are you using ESPHome 2026.1.x?

There is a new feature in ESPHome in the 2026.1.x release called post-connect roaming. This can cause connection issues when using Unifi’s Lock AP feature. Connecting to a specific AP is ultimately up to the client and not the AP, so if the ESP32 doesn’t like the AP it is locked to for some reason, it will try to connect to a different one but Unifi will kick it off and try to get it to connect to the AP it is locked to. Bit of an ouroboros situation, but it is easily remedied by either disabling the Unifi Lock AP on the client or by disabling post-connect roaming in the config.

Even if the device isn’t locked to an AP in Unifi, you may try disabling the post-connect roaming to see if that improves your device’s connection to the network.

good point. I thought test was desired function.
The test is not good.

They are out of sync because they dont have common clock and of course they cannot be triggered at same time since it is not parallel trigger and network delay differences between devices (why time sync matters).

You could still create a sensor that increments on the device and use the device count at end of day to determine if all devices received same number or triggers at end of day.

example

button:
  - platform: template
    name: ${name} deepsleep
    on_press:
      - logger.log: Sleep Pressed

this add info to esp device logs

1 Like

Fascinating experiment. Just out of curiosity, what if you write that automation using parallel and list the individual entities? And use a toggle helper to track what state to set – so you are explicitly turning them on or off every other interval.

I’m not clear to me if what you are seeing is latency or dropped API calls. If you power off one of the ESP32s does HA log the connection error? It’s a TCP connection.

My irrigation system is run by a 24 relay ESP32 board. I assume things can fail. I have the ESP turn off the relay after some amount of time to avoid using too much water. That’s the best fail-safe I have as it doesn’t depend on HA being up.

I also have a smart switch as a master kill-switch that runs off the power to the ESP32/relay board. I turn that on and it has its own auto-off. Then I have a flow meter that will alert me if water is being used when it shouldn’t or if water is not running at the expected flow rate when a given valve is open. And when I turn a valve on I wait for the ESP to report that the valve is in an on state. Of course, you wouldn’t use a toggle for controlling irrigation valves.