Then I would turn on debug logging for mqtt, which will spit out alot of information. In order to find the correct info, create an automation that creates a persistent notification with a timestamp when the light goes unavailable.
The issue is getting worse over time. Seems the more entities the more probability that one dissapear without reason
Here is an example with battery powered sensor. Which makes even less sense, since it is expected this device is ussually offline.
Also note, it is one of ten simmilarily configured H&T sensors.
From image above, the sensor reported temperature at 14:52, and then went unavailable after next 30 mins. Which leads to conclussion it’s not network nor mqtt issue but mqtt integration or ha core.
Because the device provides few sensors (temp, hum, battery - all manually configured), I’ve checked other ones, and all of them has outage at this time. here is humidity sensor:
It’s even more strange that whole device is out (all 3 sensors), while others mqtt devices are ok.
At time of this outage I had enabled debug logging for homeassistant.components.mqtt. But there no any indication of the issue. There are some warnings about wrongly evaluated template sensors. I’m going to clean it up, just in case there is some unexpected feedback to source mqtt sensors.
I’ve also have MQTT exporer enabled whole day. So I can see changes. But nothing strange logged at time of device outage
Anyway question is, what should I do more to narrow down the root cause? debug of homeassistant.core? Or something else?
what do you mean by a user? a role devices authenticated against when connecting to mqtt?
Then yes: all devices, HA and mqtt explorer uses the same login/pass in order to connect to mqtt (actually it’s HA user since I’m using Mosquito integration)
Should I use more mqtt connection roles?
If so I’m curious why. Are there any technical limitations?
Well, you run into issues if you use the same user to log into MQTT explorer and what HA uses to log into it as well. Make separate users. I vaguely remember having this problem years ago and that was the solution IIRC
It seems it wasn’t the reason. And it getting to be more and more painful.
Today morning 4 Shelly sensors went unavailable.
When writting this post, one of them reappeared on its own. I confirmed with second one, that value in sensor can reappear if sensor sends new value to mqtt.
There is no exact information when exaclty those sensors got unavailable. I can quess it from graphs
I could expect, that this is the point when device went unavailable (but still don’t understand value change at this point). Other devices looks the same. For some graph is interrupted in the middle of the line (without value change) others introduces inexisting reading breaking graph line at the same time. BTW all 4 sensors disappered at different times (8:11, 8:22, 9:20, 10:20)
At this point it can be anything. broken expire_after? Currently set to 24h, but you can see that data is changing more often. Some mem leak (see unlogged temp change). Really have no idea.
Probably I’m will enable debug log for core. But not sure it will provide any help.
As mentioned already values get back with new data reported by sensors or by reloading of “manually configured mqtt entities”
Since back then I had mqtt logging set to debug, but recently I had to restart HA and forgot to change logging level again.
Today I’ve lost one sensor (out of three comming from single device).
You clearly see events at 14:49 and 16:54. The last one recovered the sensor from unavailable state). But there is nothing logged for this topic at 15:19 (the only messages around 15:19:55 are other mqtt ones related to zigbee2mqtt and octoprint, nothing system/core related)
The same case for another sensor, but different hours:
mqqt events at 8:28 and 16:39. Unavailable between 13:31 and 16:39
Interesting is, all values from single device are turn unavailable at the same time (no mqtt event logged for this). They however uses the same mqtt topic and the same expire_time. An issue with expiration evaluation potentially could be the reason (it would manifest by shooting down all 3 entities at the same time)
Here is configuration of set of them
So… Something shuts down all those 4 sensors at the same time.
Different devices encounters the same issue at different time.
HA sensors recover when devices report next values to mqtt.
Any idea how to debug the reason why sensor turns into unavailable state?