All my hue lights randomly becoming unavailable

Thanks Marius. appreciated.

check your sensors, they will be unavailable too. This is 100% certain HA losing connection to the Hub.

Yeah - I think that’s different than what others in the thread have been seeing.

After reading this and other threads, I was thinking it was load related on my Rpi3, so I went ahead and bought a Mini PC. I migrated a snapshot over to it last night and we’ll see if I notice any disconnects.

You’re right, that does sound like a LAN connection issue. To confirm, I would ping the Hue Bridge (from the Home Assistant Server) when the issue occurs. If ping fails then it points to a LAN disconnection. If ping succeeds then there’s a Zigbee issue.

Nope, its not a Lan issue, it’s a Home Assistant (- Hue) issue. HA polls the Hub. Only has a limited time frame for that, and if that is exceeded (because of timing issues on the HA instance) it throws unavailable. It’s a long standing problem, caused by the fact Hue doesn’t alert on changes, but needs to be polled by HA.

As said, it should be much better now. On my Rpi4 and fully maxed out Hue Hub, I almost never see these unavailable anymore.

I did take extreme measures though, on anything. Guard all templates, logger, history, etc etc. Take out almost all Rest sensors, and the ones left, set a large timeframe for them. Separate my Zwave, and Mqtt in to separate Rpi’s to not have that traffic cause lockups, etc etc.

What you described sound plausible if the RPi is at 100% CPU load. If not, I don’t see how the integration fails to get its allotted share of execution time.

well, all I can say is ‘join the club’ :wink:

btw, it’s not that the integration doesn’t get its allotted execution time, it’s a hard limit set in the Ha Hue core code within the polling should return. If not, it returns ‘unavailable’. it’s all in the source code, and heavily debated.

set logger to

homeassistant.components.hue: debug

and you’ll see when it’s happening.


I’ve asked to split the logic on Api attribute ‘reachable’ (Hue communicating with Light) and ‘unavailable’ (HA communicating with Hue hub) but Paulus didn’t want that. Nor was it accepted to add the attribute itself. While it is very handy to have that, on lights that don’t have power. It would show a true unreachable (Hub cant see the light) and wouldn’t have to show ‘Unavailable’ (HA/Hue comm error)…

We can only hope Hue will one day change its api to a push, so HA won’t need to poll the Hub any longer. Since that is very unlikely, (we’ve all added to the Hue developers community and asked for that) all we can do is optimize the HA instance as much as possible.

Heck, I’ve even created a sensor to check for the Hub not being overloaded.

As said, it has been very reliable lately, and fingers crossed that will stay.

1 Like

FWIW, I occasionally read another community forum for a home automation product developed by one person for over ten years. This individual is a very experienced software developer and I now recall he was dumbfounded by how a company with all of Philips’ resources chose to create a polling-based API.

Is there any way to pinpoint the top 2 or 3 of these that might have helped you? Watching my new migrated instance on a Mini PC that has almost no load, and it’s still happening.

image

image

The mind-boggling thing is that this setup has been fine for YEARS, even on an RPi2. I can’t really remember when it started, but it’s within the last month or two.

I might start removing integrations one at a time just to isolate it, though I don’t really have a whole lot of them to begin with.

FWIW, I have a test instance of Home Assistant OS on an RPI3. Using the SSH Add-on, when I ping my Philips Hue Bridge it returns sub-second responses.

Screenshot from 2020-06-26 13-20-41

Unless your DNS has an entry for the Hue Bridge, you’ll have to ping it using its IP address.

The response times from my production server (running on an old laptop) are slightly faster (under a half-second). All devices are connected to the same 10/100 switch.

Only my production server communicates with the Hue Bridge and I haven’t noticed incidents where all Hue devices become unavailable (I currently only have a half-dozen Hue Color bulbs).

anything else you added? or changed in the HA setup?

start with the recorder and logbook. simply only include what you need, and exclude anything else, including events.

exclude:
  event_types:
    - service_removed
    - service_executed
    - platform_discovered
    - homeassistant_start
    - homeassistant_stop
    - feedreader
    - service_registered
    - call_service
    - component_loaded
    - logbook_entry
    - system_log_event
    - automation_triggered
    - script_started
    - timer_out_of_sync

  domains:
    - alert
    - automation
    - binary_sensor
    - camera
    - climate
    - counter
    - customizer
    - device_tracker
    - group
    - input_boolean
    - input_datetime
    - input_number
    - input_select
    - input_text
    - light
    - media_player
    - proximity
    - scene
    - sensor
    - script
    - sun
    - switch
    - persistent_notification
    - person
    - remote
    - timer
    - updater
    - variable
    - weather
    - zone

again, the ping results won’t help here. Still, to compare:

~ $ ping 192.168.1.212
PING 192.168.1.212 (192.168.1.212): 56 data bytes
64 bytes from 192.168.1.212: seq=0 ttl=63 time=0.622 ms
64 bytes from 192.168.1.212: seq=1 ttl=63 time=0.531 ms
64 bytes from 192.168.1.212: seq=2 ttl=63 time=0.560 ms
64 bytes from 192.168.1.212: seq=3 ttl=63 time=0.497 ms
64 bytes from 192.168.1.212: seq=4 ttl=63 time=0.418 ms
64 bytes from 192.168.1.212: seq=5 ttl=63 time=0.510 ms
64 bytes from 192.168.1.212: seq=6 ttl=63 time=0.478 ms
64 bytes from 192.168.1.212: seq=7 ttl=63 time=0.508 ms
64 bytes from 192.168.1.212: seq=8 ttl=63 time=0.454 ms
64 bytes from 192.168.1.212: seq=9 ttl=63 time=0.710 ms
64 bytes from 192.168.1.212: seq=10 ttl=63 time=0.526 ms
64 bytes from 192.168.1.212: seq=11 ttl=63 time=0.536 ms
64 bytes from 192.168.1.212: seq=12 ttl=63 time=0.506 ms
64 bytes from 192.168.1.212: seq=13 ttl=63 time=0.481 ms
64 bytes from 192.168.1.212: seq=14 ttl=63 time=0.456 ms
64 bytes from 192.168.1.212: seq=15 ttl=63 time=0.454 ms

and this is what’s registered on the Hub:


Add the hue debug setting and check for +5 seconds events. If not, all is well, but these are the moments all Hue go unavailable.

I mean, you know how it goes - You add little things here and there and then one random day your wife shouts from another room “Alexa won’t turn the Family Room lights off!”

Good starting list - Thank you! I was just going through one of my old RPi2 configuration.yaml files and found that I had a bunch of things excluded from recorder and logbook, so I’ll start there and then expand to what you’ve shared if the problem doesn’t go away.

You may have misunderstood my suggestion. Use ping to check connectivity when the devices are reported to be unavailable.

Pinging when they are all available is not likely to reveal much unless response times are unusually long even under normal circumstances (unlikely). I only posted my results for comparative “normal” conditions. As mentioned, I’ve never experienced the unavailable situation so I can’t provide ping times for that condition.

Anyway, if you say it’s a glitch with the integration then so be it.

@roddie, check this How to reduce your database size and extend the life of your SD card by @tom_l for detailed info on settings for recorder, logbook and history (which I forgot all about…)

Thanks, @Mariusthvdb! This is excellent. I went through an exercise like this when I was still on my RPi2, but I didn’t bother when I rebuilt on an RPi3 or migrated to the Mini PC.

That said, I’m thinking it might be related to the Integration itself because that’s really the only thing that I can think of that changed between stable and my wife yelling at me. I’ve removed the integration and gone back to the manual entry in my YAML file.

I’ll keep working at it and let everyone know what I come up with

So, an update - I removed the Hue integration and went back to the manual YAML method:

hue:
  bridges:
    - host: 10.x.x.202

For whatever reason, HA didn’t see the hub/lights at all anymore - IIRC, I had to manually trigger a discovery when I had it configured this way before, but I was short on time, so I didn’t bother trying to troubleshoot for very long.

I added the Hue integration back while leaving the YAML in-tact, and I’ve had only a fraction of the “unavailable” messages in my Logbook since, and only with one bulb. Prior to this, I was getting them a lot more often with all of my bulbs.

It could just be a coincidence, and I’ll keep monitoring, but so far, so good.

I do have exactly the same issues since the last update. All 1-2 minutes hue lights go to off/unavailable in the log. No changes on the config - it worked for years before. Any real fix/solution? Sounds like try and error approach so far?

I see in the log file:
Timeout fetching sensor data
Logger: homeassistant.components.hue.sensor_base
Source: helpers/update_coordinator.py:140
Integration: Philips Hue (documentation, issues)
First occurred: July 17, 2020, 11:21:47 AM (46 occurrences)
Last logged: 7:07:23 AM

and

Timeout fetching light data
Timeout fetching group data

Log Details (ERROR)

Logger: homeassistant.components.hue.light
Source: helpers/update_coordinator.py:140
Integration: Philips Hue (documentation, issues)
First occurred: July 16, 2020, 9:03:50 PM (99 occurrences)
Last logged: 7:25:14 AM

My log file is showing three different errors in fact. I wonder if all of them are coming from different issues that people may be having here:

  • Request failed 3 times, giving up (x114),
  • Timeout fetching group data (x38),
  • Timeout fetching sensor data (x19).

I have made sure that my bridge has a fixed IP. Zigbee is operating on channel 11 and WiFi, I think, on Channel 6, so there should be no interference(?). Other than running a little low on ideas.

I may have made some progress with my issue. I have been pinging continuously every 2s my pi based HA server from another pi. Both are connected to the same stock Virgin router. Every other minute, I can see host becoming unreachable:

Mon 10 Aug 08:22:58 BST 2020: 64 bytes from 192.168.0.60: icmp_seq=405 ttl=64 time=0.475 ms
Mon 10 Aug 08:23:00 BST 2020: 64 bytes from 192.168.0.60: icmp_seq=406 ttl=64 time=0.624 ms
Mon 10 Aug 08:23:13 BST 2020: From 192.168.0.50 icmp_seq=411 Destination Host Unreachable
Mon 10 Aug 08:23:13 BST 2020: From 192.168.0.50 icmp_seq=412 Destination Host Unreachable
Mon 10 Aug 08:23:17 BST 2020: From 192.168.0.50 icmp_seq=413 Destination Host Unreachable
Mon 10 Aug 08:23:17 BST 2020: From 192.168.0.50 icmp_seq=414 Destination Host Unreachable
Mon 10 Aug 08:23:22 BST 2020: From 192.168.0.50 icmp_seq=415 Destination Host Unreachable
Mon 10 Aug 08:23:22 BST 2020: From 192.168.0.50 icmp_seq=416 Destination Host Unreachable
Mon 10 Aug 08:23:25 BST 2020: 64 bytes from 192.168.0.60: icmp_seq=417 ttl=64 time=2079 ms
Mon 10 Aug 08:23:25 BST 2020: 64 bytes from 192.168.0.60: icmp_seq=418 ttl=64 time=0.648 ms

and when I refresh the logbook in HA these are clearly correlated with my bridge going down. So this is looking now like a network connectivity on the server side.

I am trying to evaluate whether this is happening because of the router or because of my Pi4.