How to monitor single component

maxym · September 26, 2024, 11:38am

As you can see on the screenshot below, something on my HA instance is getting wild, crashing it at the end. After restart it works stable for a few days. Right now I can see the first symptoms of getting hot again.

It’s rpi4 8GB w/SSD, using HAOS 6.x and HA2024.8, some number of add-ons as well as HACS-installed extensions.

Primarily I would be happy with separate CPU usage of add-ons (dockers). Glances don’t provide such information into sensors AFAIK.

Thank you in advance

tom_l · September 26, 2024, 11:57am

Do you use the CPU Speed integration?

If so. Try removing it.

maxym · September 26, 2024, 12:07pm

Thank you for attempting to help. Appreciate that.
No, I don’t use CPU Speed integration

Here is a list of integrations and addons installed.

BTW the issues started during last a few months after updating to 2024.7 or 2024.8.
The system was rock stable for about 4 years.

petro · September 26, 2024, 12:56pm

Check out this post, it will likely help you with everything you need to find the issue.

maxym · September 26, 2024, 1:13pm

Thank you
I will try that.

But it doesn’t address add-ons (docker-based). Is there any easy-to-use tool to monitor dockers performance? Having sensors with their CPU usage I would correlate it with total CPU usage.

The Glances shows CPU usage per docker, but I cannot find this information published to sensors (or their attributes)

maxym · September 28, 2024, 6:06am

I probably found the component causing the load. It’s the node process. Is it node.js?
How can I find where it comes from?
It’s shown by Glances but invisible in HA terminal. So I assume it’s running in OS level?

Homeassistant restart didn’t influence the load on node process.

The only component I found referencing node.js is HACS2.0. But I assume that HA restart should improve the situation then.

petro · September 30, 2024, 12:05pm

Node red uses nodejs and it looks like it’s your primary automation engine. Runaway automation? I’ve seen it happen before. Although that is also in the list.

Studio code server has had issues with it’s extensions. I have that installed but off at all times. I only use it when I’m external.

maxym · September 30, 2024, 2:07pm

Thank you for your answer

On Saturday, I reinstalled my HA from scratch, updating OS to 13.1, then restoring its config from a backup. I found that average CPU temperature dropped by 10 degree. So far so good.

But the system crashed again this morning. This time it seemed because of running out of memory.
The interesting fact is, that only home assistant and ha_observer have been restarted. Their restart caused memory usage to go back to normal.
Since then it’s growing again.

It doesn’t mean, other components are not responsible for that (ie NR you mentioned).

The screenshot below shows two events:

Saturday’s CPU overloading/overheating followed by system reinstall
today’s run out of memory followed by HA restart.

At first glance, it looks like different root causes.
Previously I wrote it’s caused by node.js. This component is used by Z2M AFAIK. I found no other components used that. But this time it looks like a memory leak.

TBH I would be happy to find some add-on causing that. But if those problems are triggered by HA process, it will be extremely hard to find a single component responsible for that. I can remove integrations from the system one by one checking if the situation improves.
Maybe debugging tools would help - but I’m not familiar with them, hoping that way of experiments is less time-consuming.

petro · September 30, 2024, 3:50pm

bdraco’s post that I linked above should help you find offending python packages. That would point to integrations (if the issue is in core)

maxym · September 30, 2024, 6:02pm

OK,
first I’ve tried is checking templates.
Guided by the link you provided, I run profiler.dump_log_objects. It displayed a lot of messages like:

2024-09-30 19:51:34.803 CRITICAL (SyncWorker_58) [homeassistant.components.profiler] RenderInfo object in memory: <RenderInfo Template<template=({{ (states('sensor.wattsonic_home_consumption_now')|float(0)*1000)|round(0) }}) renders=15134>states=False all_states_lifecycle=False domains=frozenset() domains_lifecycle=frozenset() entities=frozenset({'sensor.wattsonic_home_consumption_now'}) rate_limit=None has_time=False exception=None is_static=False>
2024-09-30 19:51:34.803 CRITICAL (SyncWorker_58) [homeassistant.components.profiler] RenderInfo object in memory: <RenderInfo Template<template=({{ (states('sensor.wattsonic_battery_p')|float(0)*1000)|round(0) }}) renders=5158> all_states= all_states_lifecycle=False domains=frozenset() domains_lifecycle=frozenset() entities=frozenset({'sensor.wattsonic_battery_p'}) rate_limit=None has_time=False exception=None is_static=False>
2024-09-30 19:51:34.803 CRITICAL (SyncWorker_58) [homeassistant.components.profiler] RenderInfo object in memory: <RenderInfo Template<template=({{ ((states('sensor.shelly_3em_phase_1_power')|float + states('sensor.shelly_3em_phase_2_poweroat + states('sensor.shelly_3em_phase_3_power')|float + states('sensor.pg_cube_power')|float)) |round(2) }}) renders=20170> all_states=False all_states_lifecycle=False domains=frozenset() domains_lifecycle=frozenset() entities=frozenset({'r.shelly_3em_phase_3_power', 'sensor.pg_cube_power', 'sensor.shelly_3em_phase_1_power', 'sensor.shelly_3em_phase_2_power'}) rate_limit=None has_time=False exception=None is_static=False>

It has a reasonably high number of renders. But isn’t it expected considering those transform real-time frequently changing data?

shelly_3em_phase_1_power template has existed in my system for several years.
those wattsonic*, since August.

Edwin_D · September 30, 2024, 6:05pm

For those using HAOS/supervisor instead of Docker install: The Home Assistant Supervisor integration has (disabled) entities for CPU and memory of all addons. If you enable them, you can monitor them. As I create the screenshot, I see Studio Code is taking way more CPU than usual

maxym · September 30, 2024, 6:22pm

Thank you.
Didn’t know about that. Very useful.

maxym · September 30, 2024, 8:08pm

not sure how should I decode Memory Growth log records:


2024-09-30 22:02:44.403 CRITICAL (SyncWorker_14) [homeassistant.components.profiler] Memory Growth: [('dict', 2761773, 1516), ('Context', 831349, 306), ('Event', 451234, 235), ('State', 452525, 235), ('coroutine', 396, 4), ('builtin_function_or_method', 8789, 2), ('Task', 96, 1), ('Future', 129, 1), ('FutureIter', 79, 1)]
2024-09-30 22:03:13.889 CRITICAL (SyncWorker_26) [homeassistant.components.profiler] Memory Growth: [('dict', 2763092, 1319), ('Context', 831844, 495), ('State', 452787, 262), ('Event', 451495, 261)]

The thirt number in each 3-set, is delta of increase. I assume the small number is OK, the high number might indicate the problem. But what indicates real problem?

BTW
Memory eaten by HA container still grows, reflecting total used memory graph very accuratelly.

maxym · October 1, 2024, 8:30am

Because I was not able to find anything helpful while memory usage by HA went critical, I disabled/removed lot of integrations and AddOns.
I left ModBus integration (the one I’ve added in Augus when problems started to happen).

Removed Custom Components:

browser_mod
circadian_lighting
cz_energy_spot_prices
dreame_vacuum
fordpass
hacs
ltss
monitor_docker
powercalc
rpi_power

Disabled Addons:

Samba Share
NodeRed
Grafana
SQLite Web
Zigbee2MQTT
SVC

Remaining addons

SSH
FileEditor
Mosquito Broker
Log Viewer
Unifi Network Controller
Home Assistant Google Drive Backup
Glances

I will monitor the state day or two, then if no issues, I will start to add extension by extension.

maxym · October 3, 2024, 8:03am

After 2 days, HA confirmed it’s stable. Since start it’s memory usage slowly went from 6 up to 8%, slightly oscillating near this value

Yesterday I’ve added back Zigbee2MQTT, NodeRed and HACS to the party.

pink line at about 10% is UnifiController.
pink at at about 4% is NodeRed
Zigbee goes at about 1.25%

I bet this setup will be stable too.

maxym · October 9, 2024, 8:15am

I have the first candidate: ltss.
This is how memory usage of HA looks like after 2 days

I will do one round again without and then with ltss before I report it.

maxym · October 11, 2024, 9:42am

So it’s almost certain, the LTSS custom component causes the problem.
I found in HA logs the LTSS record

ValueError: A string literal cannot contain NUL (0x00) characters.

corresponding with the time at which memory usage start to grow. It might happen anytime, but right now it happens on the component/HA start.

I don’t know where this “broken” data are read from (API, recorder, SQLite). It’s not from the destination database, since it doesn’t even attempt to connect the Postgres.

Obviously, without this component, the memory leak is not observed.

I’ve reported the issue: Innit error and possible memory leak · Issue #213 · freol35241/ltss · GitHub

petro · October 11, 2024, 10:27am

This may be worth writing up an issue against core so that it blocks that version of LTSS.

maxym · October 11, 2024, 10:41am

How should I understand? Is core somehow responsible for custom components? Or it’s desired to block components known for serious issues?

On top of that I cannot say the ltss version since the issue is present. It might be more version back.

Of course, I’m ready to file the issue but need to understand that dependency first.

petro · October 11, 2024, 10:42am

If a custom integration takes down Home assistant, it’s typically added to the block list so that it cannot take down home assistant.

github.com

home-assistant/core/blob/4c1b7add39508b74e8c33a35a0648430fcfd43fc/homeassistant/loader.py#L92


      
          
          
          @dataclass
          class BlockedIntegration:
              """Blocked custom integration details."""
          
              lowest_good_version: AwesomeVersion | None
              reason: str
          
          
          BLOCKED_CUSTOM_INTEGRATIONS: dict[str, BlockedIntegration] = {
              # Added in 2024.3.0 because of https://github.com/home-assistant/core/issues/112464
              "start_time": BlockedIntegration(AwesomeVersion("1.1.7"), "breaks Home Assistant"),
              # Added in 2024.5.1 because of
              # https://community.home-assistant.io/t/psa-2024-5-upgrade-failure-and-dreame-vacuum-custom-integration/724612
              "dreame_vacuum": BlockedIntegration(
                  AwesomeVersion("1.0.4"), "crashes Home Assistant"
              ),
              # Added in 2024.5.5 because of
              # https://github.com/sh00t2kill/dolphin-robot/issues/185
              "mydolphin_plus": BlockedIntegration(