Device going unavailable

pjn77 · October 17, 2021, 4:27pm

I set up an esp32 based ‘home assistant glow’ and it’s been working solidly for several weeks now until the last couple of days when it seems to be dropping off the network. I don’t think it’s the network itself as nothing else has changed, and no other connected device has problems.

It’s becoming frequently unavailable - with the web service on it becoming unresponsive. It’s still ‘on’ as the LED activity light is flashing as if it’s fully working. Powering off and on again brings it straight back up.

Never had to investigate a problem with one of these - whats the best way of troubleshooting and getting the cause of this? Is this a known problem that can occur?

glyndon · October 17, 2021, 4:36pm

You’re not alone. There have been several reports of random ‘unavailable’ moments for ESPHome devices on here. I’m currently suffering a bout of them myself. Try a search for “became unavailable esphome” and you’ll find several.
Seems like (for me at least) despite trying all the fixes, the problem still surfaces, and is very hard to isolate. When I think something fixed it, it’ll be fine for a while (hours, days, weeks…) then suddenly it’s happening again.
My current theory is that the WiFi timing parameters in [the Arduino WiFi stack used by] ESPHome are marginally incompatible with those in at least one of my AP’s (I have two different kinds, one running OpenWRT, one running FreshTomato). When I reboot the OpenWRT AP, the problem goes away for a while. If I switch off its radio and kick all the devices to the FreshTomato AP, they also seem to have a more stable link to HA. But that’s just anecdotal and I would not call it a solid causal conclusion yet. Just mentioning it here because it’s an approach not often suggested for this syndrome.

vincen · October 17, 2021, 4:46pm

I would check first power supply (usual symptoms of a not enough powerful power supply

pjn77 · October 17, 2021, 5:17pm

Thanks @glyndon for letting me know im not alone - though not sure if that’s better or worse.

@vincen i’ll check that - though that wouldnt explain why it’s been running fine for weeks!

fantangelo · October 17, 2021, 11:54pm

Same here. After many months of solid wifi connection on 1 of my ESP32 esphome devices I am now getting regular disconnects. The only change has been an update to HA core 2021.10.5

pjn77 · October 18, 2021, 7:39pm

Well I’ve got a fix … for now - I have two esp8266 powered from one usb charger (which is supposed to supply enough power for both) plugged into a smart switch which now switches off and on should either go unavailable.

glyndon · October 18, 2021, 8:58pm

This issue, and the fact that we don’t seem to be seeing problems with all WiFi devices like tv streams, etc. leaves me wondering if the HA-ESPhome protocol is just way too fragile to tolerate random but normal levels of latency and loss on a wifi link.
(I’ve not looked closely to see if it’s over TCP or UDP, or if they wrote their own layer4 protocol.)

fantangelo · October 18, 2021, 10:02pm

I also think that it is the HA-ESPhome protocol that is the problem and not wifi.

glyndon · October 18, 2021, 11:08pm

Seems that it runs over tcp/6053, which should make it pretty immune to random packet drops and the like.
Which tells me that perhaps the HA-ESPHome protocol has some built-in time-dependencies that are too tight or should not exist at all (i.e. they should be letting layer 4 handle those).
(Or it’s a flaw in the Arduino IP stack that’s flapping when there’s too much lower-layer latency.)

TCP will (usually) always deliver, or [eventually] report session breakdown. TCP might deliver in-session data late (e.g. if a retry or two was needed at layer 4 or below), so whatever rides TCP should [usually] pretend time is not a factor and just let the TCP session/pipe do its job.

I’m no expert on the OSI model nor the IP protocol, but I do get the idea that a session, like a phone call, is either open or closed and if open you might just have to wait for the person at the other end to say something. But it’s not the phone’s fault if that delay is frustrating.

pjn77 · October 19, 2021, 11:31am

It’s not been 24 hours yet but since shuffling round my chargers everything seems stable. One of my nodes connected to a certain charger was failing constantly, on another its solid. But another node connected to that same problematic charger is running fine.

I think maybe some combinations of esp and sensors are just more sensitive to power issues than others?

edit: spoke to soon one of them dropped off

glyndon · October 19, 2021, 1:40pm

Weird new evidence here: After watching one device (only one - not always the same one - is usually affected out of about 20, the rest of which are fine) having the random ‘unavailability’ problem for a day or two, I power-cycled it (soft-restarting it didn’t affect anything).
Now, the problem has jumped to a different device, and the power-cycled one is no longer acting up.
Then I remembered the similarity of this syndrome to cases where a LAN has multiple DHCP servers - a very bad thing.
Multiple DHCP servers will also create ‘ghost’ problems that seem to jump from system to system.
So, I’m looking at mDNS (avahi, aka Bonjour) and DHCP data to see if the device suffering the unavailability problem is defined differently in different hosts’ avahi databases.
e.g. depending on how a mDNS query by HomeAssistant gets resolved, it may be that HA finds itself looking for a device at the wrong IP Address momentarily, declares it ‘unavailable’, and then tries again but this time gets the proper IP address from mDNS and voila! The device is magically available again.
This fits the common description where logs (on the device, the WiFi, the HA host) show no link-layer problems.
I’ll let you know if I come up with anything…

pjn77 · October 19, 2021, 5:20pm

I like your thinking with this. A bit of googling shows that you can set a config in the esphome addon to use ping for status instead of mdns queries. Ive enabled it and just need to wait and see if it makes any difference.

glyndon · October 19, 2021, 5:57pm

The device that’s being flaky today (for about the past 16 hours at least) has stopped being randomly unavailable after unplugging it for about 5 minutes or more. Just unplugging for a few seconds didn’t cure it, but the longer duration seems to have.
Now, rather than wonder “what might be persisting inside the device?”, I’m drawn even more strongly to wondering “did that longer duration cause it to age out of the mDNS cache, so that when it reappeared on the net, mDNS was ‘happier’ with it?”

Mikefila · October 19, 2021, 6:12pm

Do you have static IP’s set on the nodes, specifically inside the esphome sketch? DHCP has long been known to cause random connection problems with some nodes.

glyndon · October 19, 2021, 6:43pm

Presuming you’re asking the OP the but I’ll contribute that all of mine are supplied their addresses by DHCP.
However, for the moment all seems happy, even the one that was a problem earlier has now stabilized.
So my next thing is to see about periodically clearing the HA system’s avahi mDNS cache.
I’m looking into the idea that mDNS gets slightly corrupted at times, misresponds for a devices name at random times. When that device ages out of its cache, another one becomes flaky as the corruption shifts to affecting a different cache record.
It’s a pretty wild hypothesis, but I have time to chase down this one, since none of the other solutions have worked.
Right now, my test is to restart avahi-daemon (I’m running HA Core on a Python venv on a RasPi4).
So far, no mysterious ‘unavailabilities’ have popped up. Time will tell…
If the glitches recur, I’ll try something like having cron restart avahi-daemon every night, and see.

mwolter · October 20, 2021, 4:14am

My 2 cents, might be all it’s worth, is that Home Assistant and ESPHome rely far too much on mDNS and WiFi. mDNS causes far too much “chatter” on the network and that gets compounded with WiFi’s low bandwidth and susceptibility to interference and collisions. Especially when vlans aren’t used to separate broadcast domains.

Ive disabled all “auto discovery” mechanisms in home assistant, disabled mDNS in ESPHome and all ESP’s and move as many devices to some other protocol (Ethernet, zwave, rf433, rfm69 915mhz) that doesn’t use 2.4ghz or 5ghz. Only devices broadcasting on WiFi are Apple products so they can discover each other. Everything has a static dhcp reservation in the router so there’s less of a chance of an address changing.

I’ve tried to dedicate WiFi to phones, tablets and a select few IOT devices where another form of connectivity wasn’t available. My devices have never been more stable or faster and I live in a somewhat densely populated area with at least 20 neighbor WiFi access points.

Edit:
20 access points seemed like too many and was curious. Scanned and found 85 neighboring access points with an RSSI less than -95db. Making sure your own network is working properly and not congested is key.

glyndon · October 21, 2021, 12:29am

FWIW, I removed ‘discovery:’ from HA’s configuration.yaml and restarted.
About the same time, I shortened my LAN’s DHCP lease time from 4 hours to 4 minutes.
Several hours later, I noticed… nothing had changed. There were still a couple inexplicable ‘unavailable’ events in HA, but it’s been that way for at least 48 hours now.
So, given that short experiment, I’m inclined to think this syndrome has nothing to do with DHCP lease duration or ‘discovery’.
Still a short experiment, and without proper scientific controls, making this merely an anecdote. But still it’s something to share, if it helps anyone.

fantangelo · October 21, 2021, 10:44am

fyi, I am having the same ‘unavailable’ error and I use static IPs. Actually, here is my error log.

Can’t connect to ESPHome API for None (192.168.1.xx): Hello timed out

Can’t connect to ESPHome API for None (192.168.1.xx): Timeout while connecting to (‘192.168.1.xx’, 6053)

Can’t connect to ESPHome API for None (192.168.1.xx): Error connecting to (‘192.168.1.xx’, 6053): [Errno 113] Connect call failed (‘192.168.1.xx’, 6053)

Can’t connect to ESPHome API for None (192.168.1.xx): Timeout waiting for response for

pjn77 · October 22, 2021, 7:24am

Could this change in the just released esphome update be relevant? My problematic devices are esp8266.

glyndon · October 22, 2021, 11:45am

I wondered the same, and promptly updated all my ESP8266 devices to 2021.10.1.
Seems to have made no difference in the random unavailabilities.
And they still jump from device to device as before. One or two devices will be dropping at random times for a few hours, then they’ll stabilize and after a while another device or two will start to do it.