Shelly 2.5 gets stuck in a "overheat" state until Restart

oxc · February 15, 2022, 3:17pm

I have a number of Shelly 2.5 units in my home, all running ESPhome. I have the irq pin on the sensor set according to the cookbook, and I mostly get expected temperatures (40-50ºC during idle, 50-70ºC during load, depending on power the connected devices draw, and the position in the wall).

However, it sometimes happens, that one or multiple devices start to overheat, and remain in a state which is approximately 20ºC over their usual temperature until I restart them.

This often happens if they (temporarily) lose WiFi, but I believe I’ve had this also happen in other scenarios.

A graph of such an encounter would look something like this:

They lost WiFi yesterday evening, and at 9:20 I restarted all of the Shellys.

What could cause the excessive heat, and how could I find out more about what is causing this? Preferably OTA, because they are all behind wall switches.

simondotau · February 16, 2022, 2:42am

Your description all but eliminates the following potential causes:

Poor connections causing resistance at the terminals. (If this was the cause, the excessive heat would correlate with load. It would also be inconsistent between devices.)
IRQ pin not set. (For reference, see Shelly 2.5 + ESPHome: potential fire hazard + fix | Savjee.be)
Faulty hardware. (Again, unlikely as you’re seeing consistent behaviour over multiple devices.)

Given that the trigger appears to be wifi state changes, the obvious most likely cause is a bug in Esphome that is causing unnecessary high CPU load. My advice would be to try disabling wifi failover (AP) mode and captive portal, as those are the code routines that presumably kick in when your wifi fails.

My standard procedure with Esphome devices is to disable (or more precisely, not enable) all non-essential components. In particular: the web server, captive portal, serial port logging, and wifi access point mode. I don’t use mqtt at all and I only use time when absolutely necessary.

In theory, this increases the risk that your device will end up in a state where it’s impossible to recover without resorting to physical access. In reality, this should never be an issue. It hasn’t been for me. In my estimation, there simply aren’t any plausible scenarios where wifi failover could solve a problem that couldn’t also be solved by changing my wifi environment in whatever way was needed for it to connect as a client. (And quite frankly, the idea of twenty recovery APs appearing any time my wifi goes down is ridiculous. If wifi goes down, the only correct solution to bring the wifi back up again.)

oxc · February 16, 2022, 2:43pm

Thanks, this is very valuable input. I have to admit, it never crossed my mind to disable the AP, even though I’ve often been annoyed by all the SSIDs showing up when I reboot my main router.

I will try that and report back.

As for the other reasons you mentioned, I believe that I can safely rule those out, as you have described.

oxc · February 18, 2022, 2:04pm

So I removed the fallback APs/captive portal from all nodes, as well as the time component (and the total daily energy sensors that needed it). From the hottest of my nodes (which runs both relays on always-on), I also removed the webserver and logger.

Here is what I can report:

idle/regular temperature does not seem to have changed at all
if WiFi gets lost, the problem still kicks in on all nodes. However, the temperature gain no longer seems as drastic:

These are the 3-hour curves from the outage from the 1st post:

These are the 3-hour curves from the (very short) outages today

So perhaps its not fully fair to compare these two, but then again, the devices no longer have to enable the AP, so it makes sense that the temperature rises less.

However, that also means that the problem seems not to be related to the fallback AP.

oxc · March 13, 2022, 8:21pm

I’ve added a Loop Time debug sensor, and it doesn’t look like the excessive heat is caused by a higher CPU load (do those CPUs even have low energy states?).

Apart from a small spike when I rebooted the router, loop time doesn’t seem to increase significantly.