Mqtt light going offline without reason

maxym · August 28, 2021, 9:28am

It’s second time one of my Shelly2.5 went offline without known reason.
When I’m saying w/o reason I mean:

HA reported the device went offline today at about 3:30 am
while

Unifi controller reports 12 days uptime for connection with the device
I can ping or connect the device web GUI
I can remotely control light using web GUI
MQTT reports all changes in light state while toggling using Web GUI

But it doesn’t translate into what shown by HA.

I have defined this light manually in lights configuration:

- platform: mqtt
  name: "Entrance"
  command_topic: "shellies/shelly25-entrance/relay/0/command"
  state_topic: "shellies/shelly25-entrance/relay/0"
  availability_topic: "shellies/shelly25-entrance/online"
  qos: 1
  retain: false
  payload_on: "on"
  payload_off: "off"
  payload_available: "true"
  payload_not_available: "false"
  optimistic: false

Also have two sensors from this device:

- platform: mqtt
  name: "Front Motion Light Power"
  state_topic: "shellies/shelly25-entrance/relay/1/power"
  availability_topic: "shellies/shelly25-entrance/online"
  icon: "mdi:motion-sensor"
  unit_of_measurement: "W"
  payload_available: "true"
  payload_not_available: "false"

- platform: mqtt
  name: "Entrance Light Temp"
  state_topic: "shellies/shelly25-entrance/temperature"
  availability_topic: "shellies/shelly25-entrance/online"
  device_class: temperature
  unit_of_measurement: "°C"
  payload_available: "true"
  payload_not_available: "false"

Do you have any idea why this device occasionally turns unavailable while still being in network?
There is a few ways how to recover:

reload all manually configured mqtt entities
restart device
command the device to send mqtt announce (virtually the same as #2)

petro · August 28, 2021, 10:07am

what’s this report as?

That’s the topic that handles whether the device is available for use in home assistant.

petro · August 28, 2021, 10:12am

FYI, you’re availability_mode is default, meaning it’s set to latest. This indicates that:

maxym · August 28, 2021, 10:37am

Availability topic is set to true. And was set to true before I reloaded entities to recover this light (I can see it in MQTT explorer)

Also please notice my remark that entity availability can be restored by only reloading mqtt entities.

Also I’ve just checked how the entity reacts on various values of availability (valid and invalid ones) and it works properly.

I’m using Mosquitto broker addon

petro · August 28, 2021, 3:20pm

Then I would turn on debug logging for mqtt, which will spit out alot of information. In order to find the correct info, create an automation that creates a persistent notification with a timestamp when the light goes unavailable.

maxym · August 28, 2021, 3:27pm

Yeah, it makes sense. Thanks.

petro · August 28, 2021, 3:28pm

You could also create an automation that logs state changes for that device only too. But it might not give you any good info.

maxym · August 28, 2021, 3:51pm

what do you mean by logging device state? Do you mean a virtual device all sensors are members of? I never configured it so don’t know how to log it.

petro · August 28, 2021, 6:13pm

- service: system_log.write
  data:
    message: "blah"
    level: info

You can write an automation to just log stuff

maxym · October 16, 2021, 3:46pm

The issue is getting worse over time. Seems the more entities the more probability that one dissapear without reason
Here is an example with battery powered sensor. Which makes even less sense, since it is expected this device is ussually offline.
Also note, it is one of ten simmilarily configured H&T sensors.

From image above, the sensor reported temperature at 14:52, and then went unavailable after next 30 mins. Which leads to conclussion it’s not network nor mqtt issue but mqtt integration or ha core.

Because the device provides few sensors (temp, hum, battery - all manually configured), I’ve checked other ones, and all of them has outage at this time. here is humidity sensor:

It’s even more strange that whole device is out (all 3 sensors), while others mqtt devices are ok.

At time of this outage I had enabled debug logging for homeassistant.components.mqtt. But there no any indication of the issue. There are some warnings about wrongly evaluated template sensors. I’m going to clean it up, just in case there is some unexpected feedback to source mqtt sensors.

I’ve also have MQTT exporer enabled whole day. So I can see changes. But nothing strange logged at time of device outage

Anyway question is, what should I do more to narrow down the root cause? debug of homeassistant.core? Or something else?

petro · October 16, 2021, 3:55pm

Are you using more than 1 user to do all these connections? If not, that’s your problem.

maxym · October 16, 2021, 4:06pm

Thank you Petro for instant replay

what do you mean by a user? a role devices authenticated against when connecting to mqtt?
Then yes: all devices, HA and mqtt explorer uses the same login/pass in order to connect to mqtt (actually it’s HA user since I’m using Mosquito integration)

Should I use more mqtt connection roles?
If so I’m curious why. Are there any technical limitations?

petro · October 16, 2021, 4:08pm

Well, you run into issues if you use the same user to log into MQTT explorer and what HA uses to log into it as well. Make separate users. I vaguely remember having this problem years ago and that was the solution IIRC

petro · October 16, 2021, 4:09pm

So i have a petro log in and a hass log in basically

petro → mqtt explorer
hass → well you know

maxym · October 16, 2021, 4:21pm

will try it definitively. thanks

maxym · October 17, 2021, 10:38am

It seems it wasn’t the reason. And it getting to be more and more painful.
Today morning 4 Shelly sensors went unavailable.

When writting this post, one of them reappeared on its own. I confirmed with second one, that value in sensor can reappear if sensor sends new value to mqtt.

There is no exact information when exaclty those sensors got unavailable. I can quess it from graphs

You can see that the data ends at 8:11:46 with 22.41 temp. Don’t know however where this value came from. There is no such value in logs.

I could expect, that this is the point when device went unavailable (but still don’t understand value change at this point). Other devices looks the same. For some graph is interrupted in the middle of the line (without value change) others introduces inexisting reading breaking graph line at the same time. BTW all 4 sensors disappered at different times (8:11, 8:22, 9:20, 10:20)

Here is config of the entities

######
# KITCHEN SENSORS
# - temperature
# - humidity
# - battery
######
- platform: mqtt
  name: "Kitchen Temperature"
  state_topic: "shellies/temp-kitchen/sensor/temperature"
  json_attributes_topic: "shellies/temp-kitchen/sensor/temperature"
  unit_of_measurement: "°C"
  force_update: true
  device_class: temperature
  expire_after: 86400
- platform: mqtt
  name: "Kitchen Humidity"
  state_topic: "shellies/temp-kitchen/sensor/humidity"
  json_attributes_topic: "shellies/temp-kitchen/sensor/humidity"
  unit_of_measurement: "%"
  payload_available: "true"
  payload_not_available: "false"
  force_update: true
  device_class: humidity
  expire_after: 86400
- platform: mqtt
  name: "Kitchen bttry"
  state_topic: "shellies/temp-kitchen/sensor/battery"
  unit_of_measurement: "%"
  payload_available: "true"
  payload_not_available: "false"
  force_update: true
  device_class: battery
  expire_after: 86400

At this point it can be anything. broken expire_after? Currently set to 24h, but you can see that data is changing more often. Some mem leak (see unlogged temp change). Really have no idea.

Probably I’m will enable debug log for core. But not sure it will provide any help.

As mentioned already values get back with new data reported by sensors or by reloading of “manually configured mqtt entities”

francisp · October 17, 2021, 10:42am

Install MQTT Explorer and let it run. See if you have gaps there too.

maxym · October 17, 2021, 10:44am

Thanks, I had it running whole yesterday. It shows no anomalies for mentioned sensors.

maxym · November 7, 2021, 1:58pm

Since back then I had mqtt logging set to debug, but recently I had to restart HA and forgot to change logging level again.
Today I’ve lost one sensor (out of three comming from single device).

As you can see only battery_mqtt (which is mqtt sensor) and battery (which is template sensor inheriting
from the first one) are unavailable.

How can I retrieve exact time of turning sensor unavailable? History doesn’t show that
What I know it happened about 5 hours ago. But when exactly?

maxym · November 16, 2021, 4:32pm

I’m starting to think it’s some serious while hidden problem with HA stability.

I succeeded to debug mqtt.
Here is appearance of unavailable state recorded during today:

As you can see, sensor was in unavailable state between 15:19 and 16:54
Here is log from core (please don’t mind the typo)

You clearly see events at 14:49 and 16:54. The last one recovered the sensor from unavailable state). But there is nothing logged for this topic at 15:19 (the only messages around 15:19:55 are other mqtt ones related to zigbee2mqtt and octoprint, nothing system/core related)

The same case for another sensor, but different hours:

mqqt events at 8:28 and 16:39. Unavailable between 13:31 and 16:39

Interesting is, all values from single device are turn unavailable at the same time (no mqtt event logged for this). They however uses the same mqtt topic and the same expire_time. An issue with expiration evaluation potentially could be the reason (it would manifest by shooting down all 3 entities at the same time)
Here is configuration of set of them

######
# BARUS ROOM SENSORS
# - temperature
# - humidity
# - battery
######
- platform: mqtt
  name: "BarusRoom Temperature"
  state_topic: "shellies/temp-barusroom/sensor/temperature"
  json_attributes_topic: "shellies/temp-barusroom/sensor/temperature"
  unit_of_measurement: "°C"
  force_update: true
  device_class: temperature
  expire_after: 86400
- platform: mqtt
  name: "BarusRoom Humidity"
  state_topic: "shellies/temp-barusroom/sensor/humidity"
  json_attributes_topic: "shellies/temp-barusroom/sensor/humidity"
  unit_of_measurement: "%"
  payload_available: "true"
  payload_not_available: "false"
  force_update: true
  device_class: humidity
  expire_after: 86400
- platform: mqtt
  name: "BarusRoom bttry"
  state_topic: "shellies/temp-barusroom/sensor/battery"
  unit_of_measurement: "%"
  payload_available: "true"
  payload_not_available: "false"
  force_update: true
  device_class: battery
  expire_after: 86400

- platform: template
  sensors:
    barusroom_temp_battery:
      friendly_name: "BarusRoom Sensor Battery"
      device_class: battery
      unit_of_measurement: "%"
      value_template: "{{ states('sensor.barusroom_bttry') }}"

So… Something shuts down all those 4 sensors at the same time.
Different devices encounters the same issue at different time.
HA sensors recover when devices report next values to mqtt.

Any idea how to debug the reason why sensor turns into unavailable state?