Can't stop ESPHome restarting with weak wifi

FredTheFrog · August 25, 2022, 3:51pm

What ESP8266/ESP32 device are you using? Does it have a u.fl antenna connector? If so, adding an external antenna may help. If not, consider using a Wemos D1 Mini Pro with external antenna.

orange-assistant · August 25, 2022, 4:21pm

You have any logs? Would be very important to don’t poke in the dark

Also you could use the history panel to show the uptime sensor beside the wifi signal sensor for better visualization, something like this

Do you also have an idea why you have periods with poor wifi coverage? Humans in the way?

dbalzan · August 26, 2022, 10:31am

I’m sure I can improve the wifi coverage but that’s not really my question.

Is it actually possible for ESPHome to operate with poor wifi without restarting? And if so, what do I need to configure in addition to reboot_timeout: 0s on wifi and api?

orange-assistant · August 27, 2022, 11:28am

I’m doing that a lot and don’t experience the problems you have:

We don’t even now at this point what your “not great” wifi signal is or if that is even the cause for your reboots (and not a complete other reason like a failed component/platform etc.)

That’s why I asked:

And as additional complimentary read:

dbalzan · August 27, 2022, 2:11pm

Thanks for the help @orange-assistant. I don’t have a wifi signal sensor but will add one.
For now, I can compare the uptime sensor to periods of unavailability, which I believe is due to wifi drop outs. Adding the wifi signal sensor should prove this either way I guess.

Thanks also for the ‘good question’ link. I have read all of the ESPHome documentation, searched for others with a similar issue and have tried to be clear on my goal. I do appreciate that we are all hobbyists and noone is being paid for their time here.

Given your last post, am I correct in assuming that your ESPHome devices are not restarting during wifi dropouts? From the docs, this does seem to be the intended behaviour when reboot_timeout is 0s.

orange-assistant · August 27, 2022, 2:57pm

We really need logs to be sure that’s the case otherwise it’s really only a (wild) guess.

The “unavailable” times actually only tell you that the api wasn’t connected but nothing if the esphome node was connected to wifi or not. If you look in the beginning of your graph you have big blocks that the esphome node is not connected to ha (via api) but at the same time it didn’t restart (uptime going up).

It will put more weight on it but you will not have any “proof” until the logs tell you it’s a planned restart because there were no wifi connection.

I’m not really certain that I do really have a lot of clients dropping the wifi connection (regularly) actually. Certainly I have esphome nodes which are kind on the “edge” (like -85/-90dBm) in terms of signal strength but they all perform fine. I actually didn’t change the defaults (15min) for a restart if there is no wifi/api connection.

I have read a lot of the esphome docs but certainly not all - there is just to much of it

Not sure if you came across this entry in the FAQ:

Frequently Asked Questions # My node keeps reconnecting randomly

specially that part here:

ESPHome reboots on purpose when something is not going right, e.g. wifi connection cannot be made or api connection is lost or mqtt connection is lost. So if you are facing this problem you’ll need to explicitly set the reboot_timeout option to 0s on the components being used.

Still, what we/you need now to get forward is logs while the disconnects to ha and reboots happen

dbalzan · July 20, 2023, 2:55pm

Resurrecting this thread as I never managed to get to the bottom of this last year and ended up closing the pool for winter

Now we are in the middle of the pool season and the issue has started to occur again regularly this week.

I tried to capture logs using the esphome logs abc.yaml > log_file command. But I’m not sure how useful it is as I seem to miss the restart (due to lack of connection?)

[14:32:38][D][sensor:093]: 'Pool Pump Uptime Sensor': Sending state 7902.55518 s with 0 decimals of accuracy
[14:34:46][D][sensor:093]: 'Pool Pump Wifi Signal Sensor': Sending state -65.00000 dBm with 0 decimals of accuracy
[14:34:46][D][sensor:093]: 'Pool Pump Uptime Sensor': Sending state 7962.56104 s with 0 decimals of accuracy
[14:34:46][D][sensor:093]: 'Pool Pump Wifi Signal Sensor': Sending state -65.00000 dBm with 0 decimals of accuracy
[14:34:46][D][sensor:093]: 'Pool Pump Uptime Sensor': Sending state 8022.55713 s with 0 decimals of accuracy
[14:36:46][D][sensor:093]: 'Pool Pump Wifi Signal Sensor': Sending state -65.00000 dBm with 0 decimals of accuracy
[14:46:40][D][sensor:093]: 'Pool Pump Uptime Sensor': Sending state 533.29999 s with 0 decimals of accuracy
[14:47:36][D][sensor:093]: 'Pool Pump Wifi Signal Sensor': Sending state -64.00000 dBm with 0 decimals of accuracy
[14:47:40][D][sensor:093]: 'Pool Pump Uptime Sensor': Sending state 593.29498 s with 0 decimals of accuracy

Is there another method to capture logs so I can try to diagnose these restarts once and for all? ESPHome is running on a Sonoff 4CH device.

orange-assistant · July 20, 2023, 3:48pm

Yes, for this kind of problems ota logs are not helpful. You need to get the local serial logs to get a clue what’s causing this unwated restarts.

Also you might want to change (increase) the log level in cause the default (DEBUG) isn’t verbose enough.

As a side node you might also try to narrow down the problems by deploying some debug sensors

Specially the reset_reason might be helpful for you as it tells you after a restart/reset happened

dbalzan · July 25, 2023, 10:24am

Thanks @orange-assistant for the continued support. I did a bit more investigation and added the debug and restart config as suggested. Since this is a Sonoff device, it is not straightforward to connect a serial logger but I will do that as a next step if there is still not sufficient info here to diagnose the cause.

The entities show i) device info inc. restart reason ii) uptime and iii) wifi signal strength. The gaps in the graphs correlate to periods where HA is reporting that the device is unavailable. I previously assumed that these dropouts was due to poor wifi signal strength but I can now see that in this specific example the wifi signal strength was actually stronger during the instable periods than the stable. So perhaps the issue is causing the wifi dropouts rather than the other way around.

From the graphs I see a long period of stability followed by a period of instability during which the device rebooted several times. The restart reasons reported from the OTA logs are sometimes exception 28 and sometimes exception 9.

 Reset Info: Fatal exception:28 flag:2 (Exception) epc1:0x40236a5b epc2:0x00000000 epc3:0x00000000 excvaddr:0x0000000e depc:0x00000000
[02:13:27][D][text_sensor:064]: 'Device Info': Sending state '2023.6.4|Flash: 1024kB Speed:40MHz Mode:DOUT|Chip: 0x00c9238b|SDK: 2.2.2-dev(38a443e)|Core: 3.0.2|Boot: 31|Mode: 1|CPU: 80|Flash: 0x00144051|Reset: Exception|Fatal exception:28 flag:2 (Exception) epc1:0x40236a5b epc2:0x00000000 epc3:0x00000000 excvaddr:0x'
[02:13:27][D][text_sensor:064]: 'Reset Reason': Sending state 'Exception'

[06:06:13][D][debug:254]: Reset Reason: Exception
[06:06:13][D][debug:255]: Reset Info: Fatal exception:9 flag:2 (Exception) epc1:0x4023b407 epc2:0x00000000 epc3:0x00000000 excvaddr:0x696817ad depc:0x00000000
[06:06:13][D][text_sensor:064]: 'Device Info': Sending state '2023.6.4|Flash: 1024kB Speed:40MHz Mode:DOUT|Chip: 0x00c9238b|SDK: 2.2.2-dev(38a443e)|Core: 3.0.2|Boot: 31|Mode: 1|CPU: 80|Flash: 0x00144051|Reset: Exception|Fatal exception:9 flag:2 (Exception) epc1:0x4023b407 epc2:0x00000000 epc3:0x00000000 excvaddr:0x6'

Does this shed any light on the issue? Or do I need to solder on some header pins to read the serial logs?

orange-assistant · July 25, 2023, 11:25am

Do you have more than one AP in range/installed?

Looks like something fatal.

Do you have the output_power set to a lower value to see if you get a more stable mileage?

output_power (Optional, string): The amount of TX power for the WiFi interface from 8.5dB to 20.5dB. Default for ESP8266 is 20dB, 20.5dB might cause unexpected restarts.

dbalzan · July 30, 2023, 12:32pm

Yes I have Google Wifi and it seems that ESPHome had connected to a weaker access point. To remove this as a factor I have now setup a new, single, wifi access point in the garage with a different SSID

Since doing this I have had 5 days of trouble free running but unfortunately today the resets started again. My observations are:

At 12.44 the reported wifi signal strength to the new access point dropped from around -50 to -60.
First outage occurred at 12.51 for ~40 seconds
Device reconnected at 12.52 for around 5 minutes.
Second outage occurred at 12.57 to 13.29 (33 minutes!)
Device crashes with fatal exception 9 two mins later at 13.31
Device is reported “unavailable” until it reconnectts at 13.39
Device crashes again at 13.44 with fatal exception 28
Device back up and running at 13.45

Interestingly, I have setup another ESPHome device next to the troublesome device and so far its been stable.

Thanks, I’ll play with that setting next and see if it changes anything.

orange-assistant · July 30, 2023, 2:57pm

Interesting. So with the same yaml configs (obviously different names) you get a different mileage?

Are they the same esp’s (dev boards)? Maybe even same batch? Or different boards/devices?

Often most crucial is the antenna design - and that was messed up on the esp-12e modules:

image840×620 43.6 KB

Source: WiFi module-The Difference Between ESP-12E and ESP-12F

generic_bios · July 31, 2023, 5:30pm

Hi,
I have the same problem with MQTT, it reboot at random time even with good WiFi coverage, my config file:

Spiro · July 31, 2023, 6:17pm

Showing your esphome yaml may help and perhaps a picture of the circuit. What about the power supply?

generic_bios · July 31, 2023, 8:15pm

The board is this one https://www.aliexpress.com/item/4000026433011.html
I’ve flashed ESPHome on it using their doc, it’s powered by 24V output from a 5KWh Solar Inverter, i guess it should be stable supply. The problem is that i can reproduce this if i reboot the router or nearby wifi repeater

Spiro · August 1, 2023, 6:27am

If you are using this on HA probably don’t need the web server. It uses a lot of memory and the 8266 doesn’t have much memory. Try removing.

web_server:
  port: 80

Also suggest to set static ip, gateway and subnet

generic_bios · August 1, 2023, 12:38pm

That make sense, does the same apply for captive_portal? would be nice to keep it for easy upgrades

Spiro · August 1, 2023, 2:39pm

Esphome documents warns you about web server but not captive portal. Leaving out bits of YAML is just part of the process of elimination for these sort of problems.

dbalzan · August 1, 2023, 7:52pm

I thought so but nope. The other ESP device has now started resetting with the same fatal exception when wifi is poor. Today it (pool-doser) experienced two restarts whilst the original device (pool-pump) experienced none… but more on that below…

They are both Sonoff devices and interestingly both seem to have the same ESP8285 chip. I found these photos online which match my models and revisions:

Anyway, Since yesterday, I think I may have made a breakthrough. On both devices I had the fallback AP and captive portal enabled (although I don’t recall ever using them).

wifi:
  ssid: !secret wifi_ssid
  password: !secret wifi_password

  ap:
    ssid: "Pool-Pump Fallback Hotspot"
    password: "my-secret-password"

captive_portal:

I had an idea that the crashes were occuring when the wifi drop outs enabled the ESPHome AP/Captive Portal. Yesterday morning I removed the AP config from one device (pool-pump) and left it on the other (pool-doser). Today I have had two reboots on the doser device and none on the pump… so far

I will need another week of running to test my hypothesis properly. But if I am correct, I wonder if there is a bug/issue with the AP/Captive Portal when running on ESP8285 chips in at least two Sonoff models

Spiro · August 2, 2023, 11:28am

In your spare sonoff you could run Tasmota to see if more stable at same time.