Troubleshooting deCONZ unavailability and high CPU loads

Hi All,
It feels like since v0.114.x I have an issue which first showed as unavailability of all bulbs connected to my deCONZ. Digging a little showed relatively high loads on CPU. Since I noticed a high memory consumption (80%+) caused by the UniFi addon, I decided to first replace the RPi3+ for a RPi4b-8GB (I missed out on the fact that 4GB at most is currently supported, so am running a development build as mentioned here. While doing that, I thought it would be good to use a brand new Samsung EVO 32GB SDCard and I use the official 15w power adapter.

Besides the pleasantly low memory this unfortunately was no solution. So the issues of which I think might relate to each other are basically.

  1. Unclear high CPU loads
    I first noticed a trend of CPU spikes at a frequency close to once an hour, and sometimes a spike in between. At the moment it is constantly higher, but that might be caused by Glances. I don’t know.
    The high CPU load annoys me because HA gets slow at turning on lights in response to motion detectors for example. I am not 100% sure issue two relates, but there is some coincidence at least.

  2. Random unavailability of (IKEA) bulbs and sensors of brands like Ikea, Philips Hue and Xiaomi
    I started monitoring this by using a trigger on the unavailability of the deConz ALL-group, so I was sure that all bulbs are unavailable. Last evening/night this happened 5 times (23:22, 00:42, 01:53, 02:39, 03:20) and during daytime 4 times untill now. The message sent contains this data:

CPU loads: 1.1/0.8/0.81
CPU perc : 5%
RAM perc : 17.3%.

CPU and memory are always about the same and load was at most 2.77, today (during daytime and so activity in the house the load was at most 4.77 when stuff got unavailable.

So besides a ton of topics, this is what I’ve done and where I am currently standing:

  1. Update the deConz firmware (to 2.05.79 / 26580700 at the moment)
  2. Change channel from 15 to 25. I never had an issue with 15, but perhaps the WiFi in the area changed, so I did an analysis and 25 is pretty quiet.
  3. Double checked and where needed updated the bulb firmware
  4. Restarted and rebooted everything multiple times, as switched hardware (From Pi3 to 4)
  5. Relocated the Pi
  6. Removed the HA database and set ‘purge_keep_days: 1’, this was only 5 in my case by the way. DB is very small, couple MB.
  7. Checked the mesh environment and for as far as I understand everything is fine.
  8. Because I wanted to know if it was the disk IO causing the problems installed Glances using the addon. As you can see in the image
    – the read/write to the disk isn’t very high when the CPU load is higher;
    – critical CPU_IOWAITs feel relatively frequent, for example a very long one at 09:15, when the Conbee nodes came unavailable. In my alert it said: CPU loads: 4.24/3.05/2.09 | CPU perc : 11% | RAM perc : 17.6%.
    That’s why I think there is a relation and decided to write this post. Next to that, the unavailability issue worsted while running Glances (on screen), what increased the CPU usage of course.

I hope I didn’t forget anything I tried, but right now I have no idea what to do to solve this issue. I am happy to buy a SSD drive (for example), but do want to know that it is solving the CPU issue. Looking at glances I am not convinced. If issue 2 remains after solving issue 1, than that is a different thing to tackle.

But one thing is sure, I need your help to troubleshoot.

1 Like

Meanwhile:
I changed the recorder ‘commit_interval:’ to 600 to see if it has a positive effect on the stability by reducing the writes. I recalled that I had this set to 60 before, but had it commented out at the moment.

I created a CPU IO Wait sensor to keep track of the IO and report it in my deCONZ monitor

command_line:
  - sensor:
      name: CPU IO Wait
      command: top -b -n1 | grep ^CPU | awk '{printf("%.0f"), $10}'
      scan_interval: 3
      unit_of_measurement: "%"
      value_template: '{{ value }}'

Thanks to the command_line sensor the pattern is pretty obvious. Perhaps it is interesting to have cpu_io_wait added to the system monitor integration. With all the RPi’s used, it seems like a useful sensor for troubleshooting.
HA_Load-vs-IO
The custom sensor seems to create some additional load by itself.
So far the 600 second commit_interval resulted in a ~50% reduce.

The IO sensor however made me decide to buy a SSD to boot the Pi from. According to this guide it’s as easy as installing it on a SD-card.

I hope this topic is of any help to anyone and I will report on the results of the SSD probably next weekend or so.

edit: Updated the sensor code to the format required in 2023.8

3 Likes

The SSD is operational since yesterday evening and so far everything is great (again). There where a couple challenges that I want to share though:

  1. I am running development build 5.2 64bit.
    It seems important, so now you know :grin:
  2. Booting from USB is not (always) as easy as said in the guide I linked in the previous post. Besides walking through these steps at Toms Hardware, I also had to downgrade the eeprom to this version (v2020.07.16-138a1).
    So now it booted from the SSD without a SDcard. All the things I’ve done in step 2 where done using a SDCard with the latest version of Raspbian Raspberry Pi OS.

There is however one very important thing to be aware of though!
In case you didn’t know USB 3.0* Radio Frequency Interference on 2.4 GHz Devices

I use a Conbee 2 stick and this interference resulted in all entities not responding to the given instructions. If you don’t have a USB extension cable available, just plug the disk into a USB2 port and it should work just fine.

With or without using your USB3 ports, using an extension cable is advised and can help in case of Zigbee signal reliability issues.

I hope I helped or will help someone with this topic and the collection of links.

Cheers!

2 Likes

This post saved my day today.

I didn’t know that USB 3 interfere with 2.4 GHz band too … I just migrated recently from running home assistant supervisor on rp4 (on SD ) to be running from SSD via the USB3 port. I didn’t observe any CPU issue. What I noticed that setup custom groups in deconz with custom switch/button trigger are failing most of times. I was getting this error “error apsde-data.confirm 0xE1” which was related to writing on the SSD USB3. After moved my SSD to the normal USB, it is good now. (noted that I was using extension for the conbee II usb, but didn’t help)