How to find which integration is causing high CPU usage?

Problem

There is one Python thread that is using a very high amount of CPU. As consequence, my RPi is running constantly hot, and the overall system is noticeably slower. While HA is starting up, the CPU loadavg gets over 5 or even over 8. After the system running “idle” for a while, the loadavg goes down to between 1.0 and 1.1.

2022-06-18-Home-Assistant-temperature

Question/Objective

How can I figure out which integration (or which code) is causing this high CPU usage?

I have dozens of integrations enabled, and I’d love to be able to track down this issue without having to manually disable each one of them.

Once I know what code causes this issue, I can take action (either disable the integration, or change some parameter, or report the bug to the authors).

System

HAOS running on RPi 3.

Right now it’s running HA 2022.3.8, but I noticed it since I upgraded from 2021.12.10 to 2022.2.9.

Context/History

I had a HA installation that was getting very outdated (about one year without being updated), so I decided to gradually update it, either version-by-version or skipping one version at most. Things were going smoothly, but when I upgraded from 2021.12.10 to 2022.2.9 I started facing high CPU usage. I don’t know if it was caused by the update, or if it was caused by some new integration that I configured. And that’s the point: I don’t know, and I want to know.

I have already tried…

I tried looking at HA core logs (ha core logs, or through the /config/logs web interface). I couldn’t find anything that looked obviously wrong. I found, however, many warnings of “Setup of … is taking over 10 seconds.”, which makes sense, since there is a thread eating up precious CPU cycles.

I also found several mentions of “E0618 … fork_posix.cc:70] Fork support is only compatible with the epoll1 and poll polling strategies”, but I don’t know where that is coming from. (See also: this forum thread.)

I’ve managed to ssh into port 22222. I tried killing that specific thread, hoping that would cause the stack trace to be logged somewhere. Unfortunately, that just caused the whole HA core to shut down, and I found nothing on the logs. (Or I failed to look in the right place.)

I thought about using strace to observe the syscalls from that thread, but there is no strace binary on HAOS.

Final words…

Huh… Help!

Have you tried turning off your integrations?
Turn off half of them. If the CPU usage is still high you have just eliminated half of the integrations as the cause. Then turn off half of the remaining integrations and check again. This way you only need four checks to find one of 16 integrations causing problems.

I also found these two issues in GitHub:

I also tried running strace. But, hey, there is no such binary in that HAOS system! Well, I found a repository of statically linked binaries, including some for arm architecture. I got the strace binary from there, then the next challenge was to save it onto the running HAOS. It has neither curl nor wget, and sftp to port 22222 fails for some reason. I managed to upload it anyway (and that’s left as an exercise to the reader), saving to /tmp (because /root is read-only).

So, after finally running strace -o fubar -p 1234, and then downloading that logfile to another computer (also left as an exercise), I managed to analyse the trace of a few seconds of the process. It’s not useful at all. Here, take a look at the entire logfile, after passing through | sort | uniq -c:

      1 epoll_pwait(50, {}, 100, 0, NULL, 8)    = 0
    826 futex(0x6611d230, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {1655671658, 0}, ffffffff) = -1 ETIMEDOUT (Operation timed out)
   1464 futex(0x6611d230, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {1655671659, 0}, ffffffff) = -1 ETIMEDOUT (Operation timed out)
   1455 futex(0x6611d230, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {1655671660, 0}, ffffffff) = -1 ETIMEDOUT (Operation timed out)
      1 futex(0x6611d230, FUTEX_WAIT_PRIVATE, 0, NULL <detached ...>
      1 futex(0x6611d830, FUTEX_WAKE_PRIVATE, 1) = 1
      2 futex(0x6611f230, FUTEX_WAKE_PRIVATE, 1) = 1
      1 syscall_403(0, 0x6097da40, 0, 0x6097da40, 0x193, 0) = 0
   3745 syscall_403(0, 0x6097db20, 0, 0x6097db20, 0x193, 0) = 0
      1 syscall_403(0x1, 0x6097da20, 0x1, 0x6097da20, 0x193, 0x691f) = 0
      1 syscall_403(0x1, 0x6097da40, 0x1, 0x6097da40, 0x193, 0) = 0
      1 syscall_403(0x1, 0x6097da80, 0x1, 0x6097da80, 0x193, 0x6097dbcc) = 0
      1 syscall_403(0x1, 0x6097db10, 0x1, 0x6097db10, 0x193, 0x43ce6bb) = 0
   3745 syscall_403(0x1, 0x6097db20, 0x1, 0x6097db20, 0x193, 0) = 0
      1 syscall_403(0x1, 0x6097db30, 0x1, 0x6097db30, 0x193, 0) = 0
   3746 syscall_403(0x1, 0x6097db30, 0x1, 0x6097db30, 0x193, 0x6097dcb0) = 0
   3746 syscall_403(0x1, 0x6097db60, 0x1, 0x6097db60, 0x193, 0x68136424) = 0

Which makes me conclude strace is a dead-end for solving my issue.

Meanwhile, the system is having loadavg of 7. This is bad.


I guess I’ll have to manually disable each integration until I track it down. I wish I had better debugging tools in this case.

I found a better tool to help me diagnose the issue: py-spy

ssh [email protected] -p 22222
login
docker exec -it homeassistant /bin/bash
apk add py-spy --update-cache --repository http://dl-3.alpinelinux.org/alpine/edge/testing/ --allow-untrusted

Then I can use top to find the PID of the Python process. Then I can run py-spy.

From that tool it seems like the Chromecast integration is very heavy. After I disabled it (and restarted HA), the loadavg is usually under 1, and the temperature is down to 55°C. Still not as low as before, but maybe it needs more time to cool down. I’ll keep investigating.


And from looking at many forum threads I also discovered the Profiler integration. That’s something I’m gonna try later.


Here’s a list of forum threads that I found, so far (in no particular order):

1 Like

I did now. And now it doesn’t want to turn on again. There is a bug that the user can’t use the UI to re-enable any disabled integration. With this terrible UX, I kinda don’t want to disable any more integrations. :disappointed:

Never make system changes without a recent backup in your pocket. This applies to everything- not just Home Assistant.

Did you “disable” and not “delete”?

disable

After you have checked the performance hit, simply re enable it:
show

enable

Oh, don’t disable Supervisor. I don’t think you can re-enable it as easily as all other integrations.

Yes, I have daily backups on a SMB shared folder, plus a few extra backups I made manually. And yes, I clicked “Disable”. As you can see on a screen recording on that GitHub issue, clicking on either “Enable” or “⋮” does absolutely nothing.

However, it might be a good idea to try a fresh install followed by restoring from the backup. Who knows, maybe that helps. I’m also gonna try moving the recorder database to a separate MariaDB server. Unfortunately, all of this trial-and-error and debugging takes quite some time.

That’s really strange because I have done this a few times troubleshooting a different puzzle. Which integration won’t restart?

Honestly, I don’t remember anymore which one. I ended up manually editing files in /config/.storage/, which also required at least three rounds of trial-and-error to fully (re-)enable the integration(s), the devices and the entities.

@denilsonsa

If you want to send me a few py-spys I’m happy to look through them.