High CPU usage after 0.113

bdraco · August 14, 2020, 2:40pm

There wasn’t anything apparent in the py-spy recording. A thread dump from py-spy might reveal if there is an integration creating a lot of threads.

tht · August 14, 2020, 7:11pm

This is the output of the dump command. I executed it at the moment when CPU usage was at about 80%.

About a minute later the CPU usage went down again to about 10%. Looks like I get about one spike of high CPU usage every 2 minutes or so. At the same time I see an increase in network traffic (incoming and outgoing).

On MQTT I do have an OpenMQTTGateway, one Zigbee2MQTT Gateway and 5 Tasmota devices. Not much going on there.

Process 5995: python3 -m homeassistant --config /config
Python v3.8.3 (/usr/local/bin/python3.8)

Thread 0x560CE284B8B0 (idle): "MainThread"
    select (selectors.py:468)
    _run_once (asyncio/base_events.py:1823)
    run_forever (asyncio/base_events.py:570)
    run_until_complete (asyncio/base_events.py:603)
    run (asyncio/runners.py:43)
    run (homeassistant/runner.py:119)
    main (homeassistant/__main__.py:312)
    <module> (homeassistant/__main__.py:320)
    _run_code (runpy.py:87)
    _run_module_as_main (runpy.py:194)
Thread 0x7FEA28243B20 (active): "Thread-1"
    dequeue (logging/handlers.py:1427)
    _monitor (logging/handlers.py:1478)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 0x7FEA2813FB20 (active): "SyncWorker_0"
    _worker (concurrent/futures/thread.py:78)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 0x7FEA272FFB20 (active): "SyncWorker_1"
    _worker (concurrent/futures/thread.py:78)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 0x7FEA269BFB20 (active): "SyncWorker_2"
    _worker (concurrent/futures/thread.py:78)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 0x7FEA263FFB20 (active): "SyncWorker_3"
    _worker (concurrent/futures/thread.py:78)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 0x7FEA25DFFB20 (active): "SyncWorker_4"
    scan (netdisco/ssdp.py:259)
    run (concurrent/futures/thread.py:57)
    _worker (concurrent/futures/thread.py:80)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 0x7FEA257FFB20 (active): "SyncWorker_5"
    _worker (concurrent/futures/thread.py:78)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 0x7FEA251FFB20 (active): "SyncWorker_6"
    _worker (concurrent/futures/thread.py:78)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 0x7FEA24BFFB20 (active): "SyncWorker_7"
    _worker (concurrent/futures/thread.py:78)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 0x7FEA23EBFB20 (idle): "Recorder"
    wait (threading.py:302)
    get (queue.py:170)
    run (homeassistant/components/recorder/__init__.py:338)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 0x7FEA237FFB20 (active): "Thread-2"
    dequeue (logging/handlers.py:1427)
    _monitor (logging/handlers.py:1478)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 0x7FEA216D9B20 (active): "zeroconf-Engine-275"
    update_record (zeroconf/__init__.py:1612)
    update_record (netdisco/mdns.py:21)
    update_record (zeroconf/__init__.py:2622)
    handle_response (zeroconf/__init__.py:2662)
    handle_read (zeroconf/__init__.py:1405)
    run (zeroconf/__init__.py:1314)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 0x7FEA215D5B20 (idle): "zeroconf-Reaper_276"
    wait (threading.py:306)
    wait (zeroconf/__init__.py:2360)
    run (zeroconf/__init__.py:1422)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 0x7FEA1FDD5B20 (active): "Thread-6"
    loop (paho/mqtt/client.py:1163)
    loop_forever (paho/mqtt/client.py:1782)
    _thread_main (paho/mqtt/client.py:3428)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 0x7FEA1A4EDB20 (active): "Thread-7"
    loop (paho/mqtt/client.py:1163)
    loop_forever (paho/mqtt/client.py:1782)
    _thread_main (paho/mqtt/client.py:3428)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)

bdraco · August 15, 2020, 1:03pm

In this case it would be best to get a py-spy recording with a duration of at least 120 seconds

Vlad · August 15, 2020, 1:28pm

I have:
Home Assistant 0.114.1

arch	aarch64
chassis	embedded
dev	false
docker	true
docker_version	19.03.8
hassio	true
host_os	HassOS 5.0
installation_type	Home Assistant OS
os_name	Linux
os_version	4.19.127-v8
python_version	3.8.3
supervisor	234

Using System Monitor platform:
pi4

tom_l · August 16, 2020, 8:03am

@bdraco one thing I have noticed is that the CPU utilisation climbs a fraction higher with each and every event from any platform. Could this be a result of the recent listener optimisations?

bdraco · August 16, 2020, 12:28pm

That code is very straight forward, so it’s not that. I wish it were as then we wouldn’t be waiting on upstream merges to fix the known ones.

bdraco · August 16, 2020, 12:49pm

Also, here is the latest list of known issues I’ve been working though https://github.com/home-assistant/core/issues/38862#issuecomment-674275786

franciszeks · August 18, 2020, 4:24pm

Unfortunatley since 0.113 I’m having the same issue of CPU usage constantly going up (only reboot helps). The same happens after updating to 0.114.2.

arch	armv7l
chassis	embedded
dev	false
docker	true
docker_version	19.03.11
hassio	true
host_os	HassOS 4.12
installation_type	Home Assistant OS
os_name	Linux
os_version	4.19.127-v7
python_version	3.8.3
supervisor	234
timezone	Europe/Warsaw
version	0.114.2
virtualenv	false

Running on Raspberry Pi 3 B+.

tom_l · August 18, 2020, 4:28pm

Restart is not the same as reboot. I hope you mean restart.

Could you please wait until the CPU use is high and run the py-spy tool mentioned further up in this topic then paste the results here. This will help with diagnosis of the issue.

chocomega · August 18, 2020, 4:35pm

It doesn’t seem to go higher than ~30% after a few days, the load sensor (15m) indicates a value around 1 though.
I hope the fix on zeroconf 0.28.1 will solve my issue. I’ll let you know when 0.114.3 is released. Thank you

franciszeks · August 18, 2020, 6:00pm

Yeah, I meant restart. Unfortunately I did it once again so I will need to wait for CPU usage to rise… which probably will take few days. Will get back with py-spy results. Thanks!

bdraco · August 18, 2020, 10:30pm

That makes sense if it is zeroconf since the thread it is running it is probably maxing out the cpu core.

shred · August 19, 2020, 1:23am

Just wanted to add I’m experiencing the same issue since 0.113 on an RPi3 B+. You can see after a restart it’s normal but it slowly continues to increase. It seems to stop around 20-25%. Will keep an eye out for 0.114.3 and update as soon as it’s released.

bdraco · August 19, 2020, 3:07am

Please post py-spy recordings & dumps if you can.

tht · August 20, 2020, 6:07am

Sorry for the delay. I did a py-spy recording for about 4 minutes. During the whole time CPU usage was above 60% on an i7 CPU. btw… I’m now on 0.114.2

This is how it looked for the last 6 hours:

I will now disable all automations just to make sure it’s not related to them.

I’m still not sure where the network traffic comes from. I’m also investigating this one.

Update: Network traffic is mostly from 224.0.0.251 (multicast DNS).

tht · August 20, 2020, 6:14am

I just realized that restarting Home-Assistant throws this in the log:

2020-08-20 08:12:22 ERROR (zeroconf-ServiceBrowser__nut._tcp.local.-_ipps._tcp.local.-_api._udp.local.-_plugwise._tcp.local.-_Volumio._tcp.local.-_googlecast._tcp.local.-_dkapi._tcp.local.-_miio._udp.local.-_axis-video._tcp.local.-_spotify-connect._tcp.local.-_daap._tcp.local.-_hap._tcp.local.-_elg._tcp.local.-_wled._tcp.local.-_viziocast._tcp.local.-_bond._tcp.local.-_esphomelib._tcp.local.-_ipp._tcp.local.-_printer._tcp.local._362) [homeassistant.components.zeroconf] Failed to get info for device
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/zeroconf/__init__.py", line 213, in service_update
    service_info = zeroconf.get_service_info(service_type, name)
  File "/usr/local/lib/python3.8/site-packages/zeroconf/__init__.py", line 2371, in get_service_info
    info = ServiceInfo(type_, name)
  File "/usr/local/lib/python3.8/site-packages/zeroconf/__init__.py", line 1732, in __init__
    if not type_.endswith(service_type_name(name, allow_underscores=True)):
  File "/usr/local/lib/python3.8/site-packages/zeroconf/__init__.py", line 268, in service_type_name
    raise BadTypeInNameException("Type '%s' must end with '._tcp.local.' or '._udp.local.'" % type_)
zeroconf.BadTypeInNameException: Type '' must end with '._tcp.local.' or '._udp.local.'

It only appears once after starting up. Probably not related. Just wanted to mention it.

Update: the message repeats. It is not only once.

tht · August 20, 2020, 1:03pm

Sorry for spamming here… just wanted to let you know it is not related to the automations. Nothing changed while all automations have been disabled.

Something happened at around 13:30 and CPU usage is way up there again.

py-spy does show about the same as last time.

I had the zeroconf error again at 13:25. Could be related. Exact same error as I’ve posted above.

I also see these kind of messages every few minutes:

2020-08-20 13:25:55 INFO (MainThread) [homeassistant.components.mqtt.discovery] Component has already been discovered: binary_sensor 240AC48074A0, sending update
2020-08-20 13:25:55 INFO (MainThread) [homeassistant.components.mqtt] Got update for entity with hash: ('binary_sensor', '240AC48074A0') '{'state_topic': 'opengw/OpenMQTTGatew
ay_ESP32_BLE/LWT', 'name': 'OpenMQTTGateway_ESP32_BLE', 'unique_id': '240AC48074A0', 'device_class': 'connectivity', 'payload_on': 'online', 'payload_off': 'offline', 'payload
_available': 'online', 'payload_not_available': 'offline', 'platform': 'mqtt'}'
2020-08-20 13:25:55 INFO (MainThread) [homeassistant.components.mqtt] Ignoring unchanged update for: binary_sensor.openmqttgateway_esp32_ble_2

So I will re-enable all automations but get rid of the OpenMQTTGateway for testing.

bdraco · August 20, 2020, 1:04pm

You have a device on your network that is broadcasting an invalid mdns name. We recently added a workaround for this issue to prevent the ServiceBrowser from collapsing. It shouldn’t cause any issue anymore besides an annoying log message. The solution is for the vendor to fix the device firmware if they are interested in doing so (most of these were caused by older HomeKit stacks that have been fixed with device firmware updates).

bdraco · August 20, 2020, 1:05pm

Thank you for the py-spy. 0.114.3 is out. Can you give it a spin?

tht · August 20, 2020, 1:22pm

This sounds promising, thanks @bdraco. There are many devices on the network and I assume a few of them could have a bad firmware.

I’m installing the update right now. Will report back in a few hours. I’ve also removed the OpenMQTTGateway from the network