Mqtt light going offline without reason

It’s second time one of my Shelly2.5 went offline without known reason.
When I’m saying w/o reason I mean:

HA reported the device went offline today at about 3:30 am
while

  1. Unifi controller reports 12 days uptime for connection with the device
  2. I can ping or connect the device web GUI
  3. I can remotely control light using web GUI
  4. MQTT reports all changes in light state while toggling using Web GUI

But it doesn’t translate into what shown by HA.

I have defined this light manually in lights configuration:

- platform: mqtt
  name: "Entrance"
  command_topic: "shellies/shelly25-entrance/relay/0/command"
  state_topic: "shellies/shelly25-entrance/relay/0"
  availability_topic: "shellies/shelly25-entrance/online"
  qos: 1
  retain: false
  payload_on: "on"
  payload_off: "off"
  payload_available: "true"
  payload_not_available: "false"
  optimistic: false

Also have two sensors from this device:

- platform: mqtt
  name: "Front Motion Light Power"
  state_topic: "shellies/shelly25-entrance/relay/1/power"
  availability_topic: "shellies/shelly25-entrance/online"
  icon: "mdi:motion-sensor"
  unit_of_measurement: "W"
  payload_available: "true"
  payload_not_available: "false"

- platform: mqtt
  name: "Entrance Light Temp"
  state_topic: "shellies/shelly25-entrance/temperature"
  availability_topic: "shellies/shelly25-entrance/online"
  device_class: temperature
  unit_of_measurement: "°C"
  payload_available: "true"
  payload_not_available: "false"

Do you have any idea why this device occasionally turns unavailable while still being in network?
There is a few ways how to recover:

  1. reload all manually configured mqtt entities
  2. restart device
  3. command the device to send mqtt announce (virtually the same as #2)

what’s this report as?

That’s the topic that handles whether the device is available for use in home assistant.

FYI, you’re availability_mode is default, meaning it’s set to latest. This indicates that:

Availability topic is set to true. And was set to true before I reloaded entities to recover this light (I can see it in MQTT explorer)

Also please notice my remark that entity availability can be restored by only reloading mqtt entities.

Also I’ve just checked how the entity reacts on various values of availability (valid and invalid ones) and it works properly.

I’m using Mosquitto broker addon

Then I would turn on debug logging for mqtt, which will spit out alot of information. In order to find the correct info, create an automation that creates a persistent notification with a timestamp when the light goes unavailable.

Yeah, it makes sense. Thanks.

You could also create an automation that logs state changes for that device only too. But it might not give you any good info.

what do you mean by logging device state? Do you mean a virtual device all sensors are members of? I never configured it so don’t know how to log it.

- service: system_log.write
  data:
    message: "blah"
    level: info

You can write an automation to just log stuff

The issue is getting worse over time. Seems the more entities the more probability that one dissapear without reason
Here is an example with battery powered sensor. Which makes even less sense, since it is expected this device is ussually offline.
Also note, it is one of ten simmilarily configured H&T sensors.

From image above, the sensor reported temperature at 14:52, and then went unavailable after next 30 mins. Which leads to conclussion it’s not network nor mqtt issue but mqtt integration or ha core.

Because the device provides few sensors (temp, hum, battery - all manually configured), I’ve checked other ones, and all of them has outage at this time. here is humidity sensor:

It’s even more strange that whole device is out (all 3 sensors), while others mqtt devices are ok.

At time of this outage I had enabled debug logging for homeassistant.components.mqtt. But there no any indication of the issue. There are some warnings about wrongly evaluated template sensors. I’m going to clean it up, just in case there is some unexpected feedback to source mqtt sensors.

I’ve also have MQTT exporer enabled whole day. So I can see changes. But nothing strange logged at time of device outage

Anyway question is, what should I do more to narrow down the root cause? debug of homeassistant.core? Or something else?

Are you using more than 1 user to do all these connections? If not, that’s your problem.

Thank you Petro for instant replay

what do you mean by a user? a role devices authenticated against when connecting to mqtt?
Then yes: all devices, HA and mqtt explorer uses the same login/pass in order to connect to mqtt (actually it’s HA user since I’m using Mosquito integration)

Should I use more mqtt connection roles?
If so I’m curious why. Are there any technical limitations?

Well, you run into issues if you use the same user to log into MQTT explorer and what HA uses to log into it as well. Make separate users. I vaguely remember having this problem years ago and that was the solution IIRC

1 Like

So i have a petro log in and a hass log in basically

petro → mqtt explorer
hass → well you know

1 Like

will try it definitively. thanks

1 Like

It seems it wasn’t the reason. And it getting to be more and more painful.
Today morning 4 Shelly sensors went unavailable.

When writting this post, one of them reappeared on its own. I confirmed with second one, that value in sensor can reappear if sensor sends new value to mqtt.

There is no exact information when exaclty those sensors got unavailable. I can quess it from graphs


You can see that the data ends at 8:11:46 with 22.41 temp. Don’t know however where this value came from. There is no such value in logs.

I could expect, that this is the point when device went unavailable (but still don’t understand value change at this point). Other devices looks the same. For some graph is interrupted in the middle of the line (without value change) others introduces inexisting reading breaking graph line at the same time. BTW all 4 sensors disappered at different times (8:11, 8:22, 9:20, 10:20)

Here is config of the entities

######
# KITCHEN SENSORS
# - temperature
# - humidity
# - battery
######
- platform: mqtt
  name: "Kitchen Temperature"
  state_topic: "shellies/temp-kitchen/sensor/temperature"
  json_attributes_topic: "shellies/temp-kitchen/sensor/temperature"
  unit_of_measurement: "°C"
  force_update: true
  device_class: temperature
  expire_after: 86400
- platform: mqtt
  name: "Kitchen Humidity"
  state_topic: "shellies/temp-kitchen/sensor/humidity"
  json_attributes_topic: "shellies/temp-kitchen/sensor/humidity"
  unit_of_measurement: "%"
  payload_available: "true"
  payload_not_available: "false"
  force_update: true
  device_class: humidity
  expire_after: 86400
- platform: mqtt
  name: "Kitchen bttry"
  state_topic: "shellies/temp-kitchen/sensor/battery"
  unit_of_measurement: "%"
  payload_available: "true"
  payload_not_available: "false"
  force_update: true
  device_class: battery
  expire_after: 86400

At this point it can be anything. broken expire_after? Currently set to 24h, but you can see that data is changing more often. Some mem leak (see unlogged temp change). Really have no idea.

Probably I’m will enable debug log for core. But not sure it will provide any help.

As mentioned already values get back with new data reported by sensors or by reloading of “manually configured mqtt entities”

Install MQTT Explorer and let it run. See if you have gaps there too.

Thanks, I had it running whole yesterday. It shows no anomalies for mentioned sensors.

Since back then I had mqtt logging set to debug, but recently I had to restart HA and forgot to change logging level again.
Today I’ve lost one sensor (out of three comming from single device).

As you can see only battery_mqtt (which is mqtt sensor) and battery (which is template sensor inheriting
from the first one) are unavailable.

How can I retrieve exact time of turning sensor unavailable? History doesn’t show that
What I know it happened about 5 hours ago. But when exactly?

I’m starting to think it’s some serious while hidden problem with HA stability.

I succeeded to debug mqtt.
Here is appearance of unavailable state recorded during today:
image
image

As you can see, sensor was in unavailable state between 15:19 and 16:54
Here is log from core (please don’t mind the typo)


You clearly see events at 14:49 and 16:54. The last one recovered the sensor from unavailable state). But there is nothing logged for this topic at 15:19 (the only messages around 15:19:55 are other mqtt ones related to zigbee2mqtt and octoprint, nothing system/core related)

The same case for another sensor, but different hours:
image
image


mqqt events at 8:28 and 16:39. Unavailable between 13:31 and 16:39

Interesting is, all values from single device are turn unavailable at the same time (no mqtt event logged for this). They however uses the same mqtt topic and the same expire_time. An issue with expiration evaluation potentially could be the reason (it would manifest by shooting down all 3 entities at the same time)
Here is configuration of set of them

######
# BARUS ROOM SENSORS
# - temperature
# - humidity
# - battery
######
- platform: mqtt
  name: "BarusRoom Temperature"
  state_topic: "shellies/temp-barusroom/sensor/temperature"
  json_attributes_topic: "shellies/temp-barusroom/sensor/temperature"
  unit_of_measurement: "°C"
  force_update: true
  device_class: temperature
  expire_after: 86400
- platform: mqtt
  name: "BarusRoom Humidity"
  state_topic: "shellies/temp-barusroom/sensor/humidity"
  json_attributes_topic: "shellies/temp-barusroom/sensor/humidity"
  unit_of_measurement: "%"
  payload_available: "true"
  payload_not_available: "false"
  force_update: true
  device_class: humidity
  expire_after: 86400
- platform: mqtt
  name: "BarusRoom bttry"
  state_topic: "shellies/temp-barusroom/sensor/battery"
  unit_of_measurement: "%"
  payload_available: "true"
  payload_not_available: "false"
  force_update: true
  device_class: battery
  expire_after: 86400

- platform: template
  sensors:
    barusroom_temp_battery:
      friendly_name: "BarusRoom Sensor Battery"
      device_class: battery
      unit_of_measurement: "%"
      value_template: "{{ states('sensor.barusroom_bttry') }}"

So… Something shuts down all those 4 sensors at the same time.
Different devices encounters the same issue at different time.
HA sensors recover when devices report next values to mqtt.

Any idea how to debug the reason why sensor turns into unavailable state?