How to find which integration is causing high CPU usage?

denilsonsa · June 18, 2022, 10:07pm

Problem

There is one Python thread that is using a very high amount of CPU. As consequence, my RPi is running constantly hot, and the overall system is noticeably slower. While HA is starting up, the CPU loadavg gets over 5 or even over 8. After the system running “idle” for a while, the loadavg goes down to between 1.0 and 1.1.

2022-06-18-Home-Assistant-temperature

Question/Objective

How can I figure out which integration (or which code) is causing this high CPU usage?

I have dozens of integrations enabled, and I’d love to be able to track down this issue without having to manually disable each one of them.

Once I know what code causes this issue, I can take action (either disable the integration, or change some parameter, or report the bug to the authors).

System

HAOS running on RPi 3.

Right now it’s running HA 2022.3.8, but I noticed it since I upgraded from 2021.12.10 to 2022.2.9.

Context/History

I had a HA installation that was getting very outdated (about one year without being updated), so I decided to gradually update it, either version-by-version or skipping one version at most. Things were going smoothly, but when I upgraded from 2021.12.10 to 2022.2.9 I started facing high CPU usage. I don’t know if it was caused by the update, or if it was caused by some new integration that I configured. And that’s the point: I don’t know, and I want to know.

I have already tried…

I tried looking at HA core logs (ha core logs, or through the /config/logs web interface). I couldn’t find anything that looked obviously wrong. I found, however, many warnings of “Setup of … is taking over 10 seconds.”, which makes sense, since there is a thread eating up precious CPU cycles.

I also found several mentions of “E0618 … fork_posix.cc:70] Fork support is only compatible with the epoll1 and poll polling strategies”, but I don’t know where that is coming from. (See also: this forum thread.)

I’ve managed to ssh into port 22222. I tried killing that specific thread, hoping that would cause the stack trace to be logged somewhere. Unfortunately, that just caused the whole HA core to shut down, and I found nothing on the logs. (Or I failed to look in the right place.)

I thought about using strace to observe the syscalls from that thread, but there is no strace binary on HAOS.

Final words…

Huh… Help!

stevemann · June 18, 2022, 10:44pm

Have you tried turning off your integrations?
Turn off half of them. If the CPU usage is still high you have just eliminated half of the integrations as the cause. Then turn off half of the remaining integrations and check again. This way you only need four checks to find one of 16 integrations causing problems.

denilsonsa · June 19, 2022, 9:02pm

I also found these two issues in GitHub:

I also tried running strace. But, hey, there is no such binary in that HAOS system! Well, I found a repository of statically linked binaries, including some for arm architecture. I got the strace binary from there, then the next challenge was to save it onto the running HAOS. It has neither curl nor wget, and sftp to port 22222 fails for some reason. I managed to upload it anyway (and that’s left as an exercise to the reader), saving to /tmp (because /root is read-only).

So, after finally running strace -o fubar -p 1234, and then downloading that logfile to another computer (also left as an exercise), I managed to analyse the trace of a few seconds of the process. It’s not useful at all. Here, take a look at the entire logfile, after passing through | sort | uniq -c:

      1 epoll_pwait(50, {}, 100, 0, NULL, 8)    = 0
    826 futex(0x6611d230, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {1655671658, 0}, ffffffff) = -1 ETIMEDOUT (Operation timed out)
   1464 futex(0x6611d230, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {1655671659, 0}, ffffffff) = -1 ETIMEDOUT (Operation timed out)
   1455 futex(0x6611d230, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {1655671660, 0}, ffffffff) = -1 ETIMEDOUT (Operation timed out)
      1 futex(0x6611d230, FUTEX_WAIT_PRIVATE, 0, NULL <detached ...>
      1 futex(0x6611d830, FUTEX_WAKE_PRIVATE, 1) = 1
      2 futex(0x6611f230, FUTEX_WAKE_PRIVATE, 1) = 1
      1 syscall_403(0, 0x6097da40, 0, 0x6097da40, 0x193, 0) = 0
   3745 syscall_403(0, 0x6097db20, 0, 0x6097db20, 0x193, 0) = 0
      1 syscall_403(0x1, 0x6097da20, 0x1, 0x6097da20, 0x193, 0x691f) = 0
      1 syscall_403(0x1, 0x6097da40, 0x1, 0x6097da40, 0x193, 0) = 0
      1 syscall_403(0x1, 0x6097da80, 0x1, 0x6097da80, 0x193, 0x6097dbcc) = 0
      1 syscall_403(0x1, 0x6097db10, 0x1, 0x6097db10, 0x193, 0x43ce6bb) = 0
   3745 syscall_403(0x1, 0x6097db20, 0x1, 0x6097db20, 0x193, 0) = 0
      1 syscall_403(0x1, 0x6097db30, 0x1, 0x6097db30, 0x193, 0) = 0
   3746 syscall_403(0x1, 0x6097db30, 0x1, 0x6097db30, 0x193, 0x6097dcb0) = 0
   3746 syscall_403(0x1, 0x6097db60, 0x1, 0x6097db60, 0x193, 0x68136424) = 0

Which makes me conclude strace is a dead-end for solving my issue.

Meanwhile, the system is having loadavg of 7. This is bad.

I guess I’ll have to manually disable each integration until I track it down. I wish I had better debugging tools in this case.

denilsonsa · June 19, 2022, 10:44pm

I found a better tool to help me diagnose the issue: py-spy

ssh [email protected] -p 22222
login
docker exec -it homeassistant /bin/bash
apk add py-spy --update-cache --repository http://dl-3.alpinelinux.org/alpine/edge/testing/ --allow-untrusted

Then I can use top to find the PID of the Python process. Then I can run py-spy.

From that tool it seems like the Chromecast integration is very heavy. After I disabled it (and restarted HA), the loadavg is usually under 1, and the temperature is down to 55°C. Still not as low as before, but maybe it needs more time to cool down. I’ll keep investigating.

And from looking at many forum threads I also discovered the Profiler integration. That’s something I’m gonna try later.

Here’s a list of forum threads that I found, so far (in no particular order):

Search results for 'py-spy' - Home Assistant Community
How to install py-spy? (2021-10-18 to 2022-02-12, installs using Alpine’s apk)
Help with installing Py-Spy to identify process pushing the CPU to 100% 24/7 (2020-11-26, tries installing using pip)
Python3 high CPU Usage (from 2020-01-03 until 10 days ago)
- _worker can be ignored as it’s idle time (2022-01-15)
WTH: Isn’t Py-Spy integrated in Home-Assistant? (2020-09-17)
Py-spy in Home Assistant running on RPi 3 with HassOS-based image (from 2020-06-03 until 2022-09-20)
Home Assistant - High Memory Usage (from 2020-03-23 until one month ago)
- py-spy command-line examples (2020-10-09)
High processor usage of python3, drives processor to high temps (2020-05-13)
High CPU usage after 0.113 (from 2020-07-28 to 2020-08-26)
Sharp permanent increases in CPU usage (2022-02-15)
- GitHub issue #66389: CPU usage is high after 2022.2.0 (still open as of now)
HA CPU usage normal? (from 2021-06-15 to 2022-05-10)

denilsonsa · June 20, 2022, 8:48pm

I did now. And now it doesn’t want to turn on again. There is a bug that the user can’t use the UI to re-enable any disabled integration. With this terrible UX, I kinda don’t want to disable any more integrations.

stevemann · June 20, 2022, 10:35pm

Never make system changes without a recent backup in your pocket. This applies to everything- not just Home Assistant.

Did you “disable” and not “delete”?

disable

After you have checked the performance hit, simply re enable it:
show

enable

Oh, don’t disable Supervisor. I don’t think you can re-enable it as easily as all other integrations.

denilsonsa · June 21, 2022, 6:04am

Yes, I have daily backups on a SMB shared folder, plus a few extra backups I made manually. And yes, I clicked “Disable”. As you can see on a screen recording on that GitHub issue, clicking on either “Enable” or “⋮” does absolutely nothing.

However, it might be a good idea to try a fresh install followed by restoring from the backup. Who knows, maybe that helps. I’m also gonna try moving the recorder database to a separate MariaDB server. Unfortunately, all of this trial-and-error and debugging takes quite some time.

stevemann · June 21, 2022, 4:36pm

That’s really strange because I have done this a few times troubleshooting a different puzzle. Which integration won’t restart?

denilsonsa · July 4, 2022, 10:27pm

Honestly, I don’t remember anymore which one. I ended up manually editing files in /config/.storage/, which also required at least three rounds of trial-and-error to fully (re-)enable the integration(s), the devices and the entities.

bdraco · July 5, 2022, 7:25pm

@denilsonsa

If you want to send me a few py-spys I’m happy to look through them.

joaopequeno · February 19, 2024, 2:04pm

Hello,

I was having the same issue, and the solution was reinstall all.
Always dificult to find.

How did you manage this?
imagem
the max and min is very usefull…

denilsonsa · February 19, 2024, 2:33pm

Using custom:mini-graph-card.

Tismo · February 28, 2024, 2:02pm

sorry for raising an old thread, Would you be willing to share the card config please?

denilsonsa · February 28, 2024, 4:16pm

As documented in available show options.

aggregate_func: max
entities:
  - entity: sensor.cpu_temperature
    show_indicator: true
    show_state: true
hour24: true
hours_to_show: 48
name: Raspberry Pi CPU temperature
points_per_hour: 2
show:
  average: true
  extrema: true
  icon: false
  labels: true
  legend: true
  name: true
smoothing: false
type: custom:mini-graph-card

Tismo · February 28, 2024, 6:42pm

Thank you sir.