Shelly devices slow to respond or update or fail with DeviceConnectionError

Hoping someone can help me out. After having problems with my Shelly Plus 1’s and my Shelly 2PM’s going offline/becoming unavailable periodically, I went ahead and made sure that:

  1. I had set the CoIoT settings including changing the client from mcast to unicast
  2. opening the 5386/udp port in the docker compose
  3. enabling access point roaming on all the shelly devices

And I’m still periodically getting the error:

shelly <switch id> failed, state: {'turn': 'on'}, error: DeviceConnectionError()

Or sometimes the switch does turn on but only after approx 3-5seconds delay. I have captured the debug output during one of these delays. See below. Does anything look out of the ordinary there? Does anyone have any other ideas? Googling this reveals this is a widespread issue with this integration, however some people have been able to resolve it by following one or several of the steps I outlined above.

2023-12-18 06:27:30.223 DEBUG (MainThread) [aioshelly.block_device.coap] Calling CoAP message update for device id E0C074
2023-12-18 06:27:30.223 DEBUG (MainThread) [homeassistant.components.shelly] Push update failures for shellyrgbw2-E0C074: 0
2023-12-18 06:27:30.223 DEBUG (MainThread) [homeassistant.components.shelly] Manually updated shellyrgbw2-E0C074 data
2023-12-18 06:27:30.223 DEBUG (MainThread) [homeassistant.components.shelly] Skipping non-input event block light_0
2023-12-18 06:27:30.223 DEBUG (MainThread) [homeassistant.components.shelly] Skipping non-input event block light_1
2023-12-18 06:27:30.223 DEBUG (MainThread) [homeassistant.components.shelly] Skipping non-input event block light_2
2023-12-18 06:27:30.223 DEBUG (MainThread) [homeassistant.components.shelly] Skipping non-input event block light_3
2023-12-18 06:27:30.223 DEBUG (MainThread) [homeassistant.components.shelly] Skipping block event
2023-12-18 06:27:30.754 DEBUG (MainThread) [aioshelly.block_device.device] aiohttp response: {'ison': True, 'has_timer': False, 'timer_started': 0, 'timer_duration': 0, 'timer_remaining': 0, 'overpower': False, 'overtemperature': False, 'is_valid': True, 'source': 'http'}
2023-12-18 06:27:30.762 DEBUG (MainThread) [aioshelly.block_device.device] aiohttp response: {'ison': True, 'has_timer': False, 'timer_started': 0, 'timer_duration': 0, 'timer_remaining': 0, 'overpower': False, 'overtemperature': False, 'is_valid': True, 'source': 'http'}
2023-12-18 06:27:31.098 DEBUG (MainThread) [aioshelly.block_device.coap] CoapMessage: ip=<private>, type=CoapType.PERIODIC(30), options={11: b's', 3332: b'SHSW-25#3494546F111C#2', 3412: b'\x96\x00', 3420: b'\xb2\n'}, payload={'G': [[0, 9103, 1], [0, 1101, 1], [0, 1201, 1], [0, 2101, 0], [0, 2102, ''], [0, 2103, 0], [0, 2201, 0], [0, 2202, ''], [0, 2203, 0], [0, 4101, 12.86], [0, 4103, 5095], [0, 6102, 0], [0, 4201, 2.39], [0, 4203, 8889], [0, 6202, 0], [0, 3104, 55.64], [0, 6101, 0], [0, 9101, 'relay'], [0, 4108, 242.4]]}
2023-12-18 06:27:31.098 DEBUG (MainThread) [aioshelly.block_device.coap] Calling CoAP message update for device id 6F111C
2023-12-18 06:27:31.099 DEBUG (MainThread) [homeassistant.components.shelly] Push update failures for shellyswitch25-3494546F111C: 0
2023-12-18 06:27:31.099 DEBUG (MainThread) [homeassistant.components.shelly] Manually updated shellyswitch25-3494546F111C data
2023-12-18 06:27:31.099 DEBUG (MainThread) [homeassistant.components.shelly] Skipping block event
2023-12-18 06:27:31.099 DEBUG (MainThread) [homeassistant.components.shelly] Skipping block event
2023-12-18 06:27:31.099 DEBUG (MainThread) [homeassistant.components.shelly] Skipping non-input event block device

The Shelly Plus Gen2 don’t have this setting
Please See docu

Why you aktivate the AP in all Shelly?
Are they not connected to a router?

Yes I’m aware that the CoIoT setting isn’t available in the Gen2 devices. However the problem I have appears in both the Gen1 and Gen2 devices regardless of the CoIoT or RPC over UDP protocol being used. I only mentioned the CoIoT setting because it is a typical setting that is forgotten about.

Regarding you enquiry about the AP setting for the shelly devices:
You misunderstand. There is a setting in the Shelly Gen2 devices called " WIFI CLIENT AP ROAMING" which enables the device to scan for a better access point.

I’m seeing the same with two shelly trv devices while my other shelly devices (shelly 1 relays) are working normally.

After I had this set up and working fine last night (with intermittent connection errors / delays), today HA asked me to reconfigure it, asking for username & password. After supplying them an error showed, saying to remove the integration and set it up again.

Latest firmware on the trvs and CoIoT enabled.

I’m not familiar with how logging works with HA but will try to get logs later for these devices and post up.

FWIW one of the trvs does seem to be buggy (restrict login option broken, doesnt want to connect to my network) so I’m going to reinstall the latest firmware on that once I get to grips with how the http api works as apparently you can just tell it to do an ota update via http requests.

Will update when i find more info.

I reset the problem TRV & it’s now working normally.
I had to delete both the devices from HA then re-add them and after that, so far it appears to be working.

I enabled debug logging on the Shelly integration so if I have another instance of this I’ll post back the logs.

TLDR: my NAS had incorrect network settings and was on a different subnet to the shelly devices. My recommendation to anyone else experiencing this issue is to try a traceroute and ping from the terminal of the machine that your HA instance is on and too one of the shelly devices.

Details:
After opening up an issue with the team that built the shelly integration and going through numerous troubleshooting steps, they replied saying that there was nothing wrong with the integration but with the connection between the HA instance and the Shelly devices.

I was skeptical of this answer because my HA instance was running on my NAS in a docker container. And the NAS was connected to my local network via ethernet (so presumably very stable). And furthermore, if I tried pinging one of the shelly devices from my Mac, there was a great connection.

Buuuut… the breakthrough came when I tried pining one of the shelly devices from the terminal of my NAS. The ping was slow and there missing/skipped ICMP_Seq numbers in the ping.

So I tried a traceroute from the MAC → looked good, single hop only
Then I tried a traceroute from the terminal of the NAs → instantly terrible. Over 25 hops in some cases. Ok so here was the issue.

Upon closer inspection of the NAS network config, for some unknown reason to me, I had set the NAS to a static IP and changed the subnet from 255.255.255.0 to 255.255.0.0.

As soon as I removed this. All the issues went away (for obvious reasons).