Any advice on tracking down high CPU utilization?

luma · August 6, 2017, 12:19pm

I’m running Hass 0.50.2 on an x86 Ubuntu 17.04 VM with a pair of vCPUs assigned to it. For a while now (so, back a few versions) I’ve been experiencing an issue where my CPU utilization pegs at 100% some time after I open an active browser window to Hass. If I close all my browser windows, the CPU goes back to normal.

The PID of the impacted process doesn’t change, so it’s not spawning a new thread and doing something over and over. Further, the utilization is perfectly normal for some time after opening a browser window, so I can’t immediately reproduce the issue just by launching a new browser session. I haven’t yet been able to identify what exactly triggers the utilization, apart from the observation that if I close all browser sessions it goes away.

Here’s what htop looks like while the issue is happening:

Here’s the last week of utilization. On Friday I went from one vCPU to two, which is why the utilization went from flat 100% to 75%-ish while the issue was happening.

I’m not sure what other logging approach I can take to dig further into this issue. Does anyone have any ideas for how I might be able to pull more information about this behavior?

zarthan · August 6, 2017, 1:50pm

What HA logging are you doing? Log level? Size of the database? What other VMs are running? What are you hosting it on?
I have a Debian 8 VM with HA 50.2 on ESXi 6.5 and I have an embarrassingly flat usage graph. I have 7 VMs running on an Intel Skull Canyon NUC but other than a MythTV DVR most are idling most of the time.

luma · August 6, 2017, 1:55pm

My vSphere farm has 5 hosts and the behavior doesn’t appear to be linked to a specific host as the workload is moved around with DRS. Currently the logger service is commented out which means logging is at whatever the default value is, and there’s no sign of excessive disk IO or prolonged disk response times which would be my primary sign of database performance issues. From the hypervisor’s perspective everything is normal outside of one core being pegged, within the VM it’s the single PID that’s hitting CPU but not RAM or disk IO.

zarthan · August 6, 2017, 2:47pm

My farm is now silent. Every month with our energy bill we got a notice were consuming 3 times as much electricity as the average neighbors and 5 times more than the most efficient. I now only power up the servers if I absolutely need to.
What do the HA logs show? Errors? The HA syslogs? Error?? Try setting the HA logger to debug, to see if anything shows. Use tail -f on syslog and the home-assistant.log and filter through grep to see if anything shows?

zarthan · August 6, 2017, 2:51pm

You might also scan the VM logs and the host logs as well. Run them through grep to filter on errors and then on things that you find in the error search.

luma · August 6, 2017, 2:54pm

Host and vSphere logs report no issues outside of the CPU utilization, which have gone silent since I added the additional vCPU (as 100% on one CPU mean 50% CPU utilization when assigned two vCPUs which falls underneath the alerting thresholds). This strongly indicates the issue is within the VM, and htop appears to confirm that - the issue is 100% the result of the single PID running hass.

The default HA logs show nothing that I wouldn’t expect (occasional errors with Sonos which have always been there). I’ll light up the debug logger and see if I can’t get the system back into a problem state.

luma · August 6, 2017, 3:13pm

Allright, the setup:

configuration.yaml

logger:
  default: debug

Test process #1

Two shells open with the following commands being run:
tail -f /home/homeassistant/.homeassistant/home-assistant.log | grep -i error
tail -f /var/log/syslog | grep -i error

And finally a third session running htop.

Then opened several browser sessions which nicely has triggered the high utilization so I can watch things.

htop output

log output

administrator@hass:~$ tail -f /home/homeassistant/.homeassistant/home-assistant.log | grep -i error
2017-08-06 11:03:47 ERROR (MainThread) [aiohttp.server] Unhandled exception
concurrent.futures._base.CancelledError
2017-08-06 11:08:46 ERROR (MainThread) [aiohttp.server] Unhandled exception
concurrent.futures._base.CancelledError
2017-08-06 11:08:54 ERROR (MainThread) [aiohttp.server] Unhandled exception
concurrent.futures._base.CancelledError
2017-08-06 11:09:14 ERROR (MainThread) [aiohttp.server] Unhandled exception
concurrent.futures._base.CancelledError

administrator@hass:~$ tail -f /var/log/syslog | grep -i error
Aug  6 11:00:02 hass hass[28686]: AttributeError: 'DidlMusicTrack' object has no attribute 'creator'
Aug  6 11:01:24 hass hass[28686]: AttributeError: 'DidlMusicTrack' object has no attribute 'creator'
Aug  6 11:03:47 hass hass[28686]: #033[31m2017-08-06 11:03:47 ERROR (MainThread) [aiohttp.server] Unhandled exception#033[0m
Aug  6 11:03:47 hass hass[28686]: concurrent.futures._base.CancelledError#033[0m
Aug  6 11:04:04 hass hass[28686]: AttributeError: 'DidlMusicTrack' object has no attribute 'creator'
Aug  6 11:08:46 hass hass[28686]: #033[31m2017-08-06 11:08:46 ERROR (MainThread) [aiohttp.server] Unhandled exception#033[0m
Aug  6 11:08:46 hass hass[28686]: concurrent.futures._base.CancelledError#033[0m
Aug  6 11:08:54 hass hass[28686]: #033[31m2017-08-06 11:08:54 ERROR (MainThread) [aiohttp.server] Unhandled exception#033[0m
Aug  6 11:08:54 hass hass[28686]: concurrent.futures._base.CancelledError#033[0m
Aug  6 11:09:14 hass hass[28686]: #033[31m2017-08-06 11:09:14 ERROR (MainThread) [aiohttp.server] Unhandled exception#033[0m
Aug  6 11:09:14 hass hass[28686]: concurrent.futures._base.CancelledError#033[0m

Then closed all browser windows and CPU utilization went back to idle.

Observations

The concurrent.futures._base.CancelledError#033[0m shares the same PID as the process consuming CPU
That error stops occurring when I close the browser windows and CPU returns to normal
The AttributeError: 'DidlMusicTrack' object has no attribute 'creator' message occurs regardless and I think is related to this annoying problem.
The #033[0m bit isn’t a copy/paste error but literal text from the log (looks like VT100 color codes)

luma · August 6, 2017, 3:27pm

Test process #2

Left log settings unchanged and tail | grep sessions running from previous test
Re-opened several browser sessions to hass
Observed high CPU utilization, same PID 28686
No errors observed in either log. If the errors above are related, they aren’t thrown every time the process is experiencing high utilization

Test process #3

After several minutes running at 100% CPU with browser windows open and no errors logged, I then closed all the browser windows
CPU immediately returns to normal
home-assistant.log records 2 lines containing “error” at the moment I close the pages:

2017-08-06 11:32:36 ERROR (MainThread) [aiohttp.server] Unhandled exception
concurrent.futures._base.CancelledError

syslog records similar errors as before at the same moment:

Aug  6 11:32:36 hass hass[28686]: #033[31m2017-08-06 11:32:36 ERROR (MainThread) [aiohttp.server] Unhandled exception#033[0m
Aug  6 11:32:36 hass hass[28686]: concurrent.futures._base.CancelledError#033[0m

zarthan · August 6, 2017, 3:46pm

You now have errors that you can check on the HA site in github and report. A Google search turned up

You are not alone.

luma · August 6, 2017, 4:03pm

Neither issues appear to be experiencing high CPU utilization (or if they are, the submitters aren’t reporting it). Not sure what to do with this information…

edit: I take that back, there’s talk of issues related to cameras. Commenting out all cameras in my config and let’s see if I can’t trigger the issue again.

zarthan · August 6, 2017, 5:10pm

Whether this is cameras or not, I think you should report this on github. Unless issues are reported, nothing gets fixed. It may not necessarily be cameras, but rather cameras in combination with some other component. Check in the github issues to see what other component that may have been mentioned and comment that out but leave cameras enabled.

luma · August 6, 2017, 6:55pm

With cameras disabled I can still trigger the high CPU utilization. I’ll try removing other components to see if I can’t get a better sense of what in my configuration might be causing the issue.

@zarthan: thanks for all your guidance so far, it has helped me clarify my thinking and approach to the problem.

zarthan · August 6, 2017, 7:05pm

Most welcome. As you must know, troubleshooting is a bitch sometimes. I personally use a modified Sherlock method. Eliminate everything that it isn’t. Whatever is left is “sometimes” the guilty party. Going the other way, start with the most basic and gradually add things. Remove the first thing that causes problems, and add the next. If the next causes problems, remove it and etc etc. Hope you get to some resolution. If you can find something this will help the developers.

luma · August 8, 2017, 12:34pm

Well there has been some degree of progress. I’m reasonably confident that the issue is not the contents of my configuration.yaml because I can re-create the issue with the following config:

homeassistant:
  name: Home
frontend:
http:
  server_port: 8123

This is making me nuts. As before, only on PID is pegged at 100% CPU, and closing all browser windows resolves the issue immediately.

zarthan · August 8, 2017, 1:29pm

Try migrating the VM to different storage.
I would really start looking deeply at you host logs and the guest logs.
Depending on how you installed it, it shouldn’t take more than an hour to reinstall from scratch. Move your config directory and reassign the USB stick.

luma · August 8, 2017, 2:40pm

I’m having a hard time putting together how host performance issues would cause a single process to run at 100% CPU and only while a web browser session is active. To help rule that out I’ve migrated the workload to a reasonably beefy host (2 x E5-2695v2s/128GB of RAM/local SSD datastore) and the issue persists. I’m reasonably confident the issue here isn’t the underlying virtualization platform and none of the performance indicators or logs on that platform are showing errors or any indication of stress (memory pressure, storage latency, or cpu utilization).

I’ll keep digging…

luma · August 8, 2017, 3:25pm

And I think I’ve found the culprit! Turns out the issue only happens when I’m accessing the environment through my reverse proxy (Netscaler). Hitting the host directly doesn’t appear to trigger the issue at all. Now to break out wireshark to figure out WTF my netscalers are doing that Hass doesn’t like.

zarthan · August 8, 2017, 4:29pm

Good luck.

zarthan · August 8, 2017, 4:31pm

That is the going basic and working outward

D85rNxd · November 11, 2018, 6:16pm

Sorry for the thread resurrection. Im seeing this on a recent install too, Im new to HA so wasn’t sure if this was just a demanding program. I’d assume given people run this on Raspberry PI’s that 2 vcores on a Xeon-d 1541 and 2GB RAM would be enough but does seem so as I had to bump to 4GB. Its a small system with a Aeotec USB Gen5, one smart plug, a HEM and a mains relay switch and a dimmer - 5 devices total.
Did you ever trace it back to anything?

EDIT: After checking debug logs Im seeing entries like this…

Nov 11 10:32:11 homeassistant hass[1005]: 2018-11-11 10:32:11 DEBUG (Thread-6) [homeassistant.components.recorder.util] converting 52743 rows to native objects took 42.854780s
    Nov 11 10:32:11 homeassistant hass[1005]: 2018-11-11 10:32:11 DEBUG (Thread-6) [homeassistant.components.history] get_significant_states took 42.891263s
    Nov 11 10:32:11 homeassistant hass[1005]: 2018-11-11 10:32:11 DEBUG (Thread-6) [homeassistant.components.recorder.util] converting 39 rows to native objects took 0.094964s
    Nov 11 10:32:12 homeassistant hass[1005]: 2018-11-11 10:32:12 DEBUG (Thread-6) [homeassistant.components.history] getting 38 first datapoints took 0.592362s