Just to add to my earlier comment, I went back to stock UniFi firmware this afternoon and enabled “Optimize IoT WiFi Connectivity” in the settings and the problem went away for all of my devices apart from with one device which once I added wifi: power_save_mode: light has also been fixed.
AFAIK “Optimize IoT WiFi Connectivity” just sets “Delivery Traffic Indication Message” to 1, which I tried setting on OpenWRT as well but it didn’t make any difference to reliability.
So far my guess is that ESPhome + OpenWRT on ath79k just doesn’t play nice.
Just a reminder, there’s a known problem.
(I have my own suspicions as to the cause, but who am I to say it’s the timing-intolerance of protobuf as a transport?)
I added ESPHOME_DASHBOARD_USE_PING=true and restarted ESPHome to bypass use of mDNS. All my nodes stay online. It appears that something has changed regarding mDNS from previous versions.
My nodes drop out of my network like dead flies after the last few updates, What is going on, flashing takes more and more time, bin’s getting bigger losing memory, and as said they are terribly unstable at the moment, secret password and credentials are not the way to go, you even need to hardcode your network in them to get them atleast a bit stable?
And does anyone know why we have so many updates lately in a week
What is going on?
There’s lots of discussion here looking at routers or network configurations. I have 5 devices with esphome that experience dropouts and cannot be updated and 2 more that just experience dropouts. All of these devices were great with tasmota and had no issues.
I’m sure it’s not WiFi related for me. The controller and all my devices are on the same 192.168.2/24 subnet with a dedicated WiFi SSID. Sometimes, though, both HomeAssistant (dedicated Raspberry Pi) and its ESPHome plugin show my device as “unavailable” or “offline”, respectively.
However, I can ping the unit from elsewhere on the network (including the 192.168.1.* subnet) and even have the ESPHome UI connect to the device to show logs. Oh, and the automation does work, with the switch/lamp turning on and off according to its schedule.
One thing I found was that /etc/resolv.conf had bad settings, both wrong default domain and the resolver set to some localhost address (127.0.0.11). Changing that allowed the HA device to now ping my device without specifying an FQDN but they still show “unavailable”.
Rebooting of the plug unit didn’t help. Restarting HA did and it then showed the device as “available” though the ESPHome plugin did not.
After watching innumerable threads on here follow the same paths this one does (try this, try that) and each person sees what appears to be success, I’ve come to the belief that we are chasing down something similar to a race condition in threaded logic. They can appear to be fixed, but will fail again when conditions become just-right (or wrong).
That is to say that each of the things suggested are good, valid possible causes, but I have come to believe that none of them are the true root cause here. Each one improves the conditions just enough for their environment to begin to seem stable. But correlation not being causation, it’s easy to think that was the true solution.
Then step over to another of a dozen threads and notice how many times that same tweak solved nothing.
Personally, I was able to evade the problem and achieve a stable link by getting rid of the ath10k-driven WiFi AP I was using (a TP-Link A7v5 under OpenWrt), and the replacement (Cudy WR1200, same openwrt but different chipset and drivers) has delivered rock-solid stability from the ESP devices ever since.
Again, correlation not proving causation, but looking temptingly like it did.
On many problem-solving threads we fail to first get a clear inventory of what devices (and driver software versions) are involved, and start conjecturing before having all the pieces.
It stands to reason that if we had those details, and compared them with others experiencing the same or similar problem, we’d start to clearly see commonalities in the problem reports, and then be able to statistically justify a correlation as the causative element.
It could save some time, at least.
Hi all,
I am suffering this issue also. I have ESPhome and Tasmota flashed devices. Tasmota ones does not become unavailable. ESPhome randomly, sometimes it affects some, sometimes only one, not always the same, but in most cases the same ones… a nightmare.
Thank you for all the valuable analysis of posible causes but it has been months since this is happening, and it is still not resolved.
I think we could think in a workaround in order to, at least, reduce the impact of the issue.
In my case (and for a lot of people in this thread also), the device goes available if I manually reload the device in Settings>> Devices >> ESPHome >> [esphome device name] , three dots and click reload.
It is fast and the device comes back to life inmediately. So, the workaround I am proposing is doing that reload operation for all esphome devices in a regular basis (crontab maybe? or a automation that triggers periocally that reload action).
The problem is that I cannot find any service to reload an entity, does anybody know how this could be acomplished?
Thank you in advance,
Miguel
ESPHome (and not Tasmota) uses a session-layer library called “protocol buffer” or “protobuf”.
Having worked with a wide variety of protocols and stacks over 3 decades, I feel qualified to opine (and it’s just a guess and opinion, I would accept correction happily) that protobuf is simply bad at tolerating latency or lost packets.
Both those things happen when a wifi signal is weak.
Rather than allow the TCP layer to handle these problems (which it exists to do), protobuf appears to time out quickly and this decides the connection is lost.
It can be very difficult to convince someone that their preferred protocol stack is flawed. I would also be difficult to convince.
As often as this has been reported, one might think a trend is visible enough to warrant attention.
Except in many cases some random other action, like moving the device, appear to fix the problem so the thread dies because the correlation is seen as causation. But the underlying cause is never truly found and fixed without changing other factors.
Sorry to rant. I fixed mine by replacing the wifi AP. But I remain certain that if I re-added some entropy to the signal path the ESPs would start randomly disconnecting without logging errors.
I have also this problem on my setup. I have a commercial grade wi-fi from ubiquiti, a dedicated nuc for homeassistant, and several esp8266 lying around. All my devices with esphome have drops and because unavailable several times a day, all the others, that uses tasmota or custom firmware NEVER had a single drop and have uptime of months on my network. I can live with some unavailability from esphome but the history of homeassistant for that devices is really messed up… It’s it possible to increase the timeout tube or the number of retries before the node is declared as dead to homeassistant?
It would be great if someone can solve this problem definitely because it’s on the shelf for more than a year…
It is possible that your ESPs are having a bad voltage regulator. At lot of cheap (and not so cheap) boards are using voltage regulators that are not able to stable provide enough current for the peaks the Wifi module is generating. That will result in unstable connections.
You could take a look at your boards and search for the voltage regulator and look at the data sheet. Another option to test is to directly provide the 3.3V.
In theory the ESP32 can peak at 500mA (in practice is should be lower). The esp8266 seems to require 320mA tops. Sadly often the voltage regulators provide less (some of my boards are getting brown out boot loops)
Or higher in practice. I think a esp82xx can reach up to 700-800mA.
I own over 100 of esp modules and boards - probably all a grey market stuff and I don’t think that I have any “brand” one - only the cheap stuff.
That said I never experienced brown out boot loops on any of them. What I do to don’t “overload” the onboard LDO is to simply drive all peripherals (that support it) directly with 5V - that’s no problem because the esp’s are tolerant to that and the integrated snap-back circuit should just trigger over 6V on the gpio’s. The 3.3Vin Pin obviously isn’t tolerant at all and expects 3.3V (± 0.3V) if I’m not mistaken.
I’m curious why, when the post reports that “it worked until I changed from Tasmota to ESPHome”, why there would be any reason to suspect the power supply.
The hardware/power didn’t change, but the firmware did.
Which reminds me of a very old axiom (i.e. from the early 80’s or before) that went something like “nothing scarier than a programmer who carries a screwdriver.”
This of course means that hardware engineers will tend to blame the software/firmware, and the reverse is likewise true. It’s been my experience in the ensuing decades that it is true (and I, a programmer, also carry a screwdriver, which only means that it’s always my problem to fix).
It was good wisdom then. Is it still?
It might be due to the fact that tasmota uses light power saving for the esp82xx while esphome on the other hand uses no power saving (for improved range and stability) by default for the esp82xx’s.
I remember one case on the esphome discord that a person had a ready made device (I think it was some kind of a smart socket) and after flashing esphome the device wasn’t “stable” anymore and “spontaneous” reboots did happened. After switching the power_save_mode to light the instabilities went away.
Something to keep in mind when one can’t “influence” the psu and is limited to a rather “undersized” one