2024.5+: Tracking down instability issues caused by integrations

arganto · November 15, 2024, 2:06pm

So, here we go

I do not understand how to read it to find, where the memory is allocated.

I started directly after a system reboot (device, not only core). There it was around 45-50% (as always in the past). I left the system alone with profiler on. At the end I clicked through all pages and areas in UI of HA to see changes. Most time it stayed around 55%. I ended with 80-90% (have a gut feeling when it jumped rapitely, but have to double-check) before I stopped the profile log.

Can you or one see from the logs, where all the additional memory is taken?

petro · November 15, 2024, 3:19pm

It’s kind of a pain in the ass to read, but each memory growth is a list of the following: name of growth, total current size, increase in growth. The data is sorted by what had the largest growth. So, we are simply looking at the first or second elements after the phrase Memory Growth that are large. The first one is when it starts, you can ignore that one. It tells you everything that’s loaded at start thought.

At 12:21:41, there’s a huge jump in a lot of data. Same thing at 14:43.11. I’m not seeing anything that stands out. Maybe Nick has some incite.

bdraco · November 15, 2024, 3:29pm

Nothing stands out in the data. Usually when there is an obvious leak each cycle will show more of the same object types over and over. 'set', 120957, 2 is a bit high but thats not specific cause for concern.

I’d do an profiler.start next.

Sadly this one looks like one of the harder ones to track down. It may require figuring out what the specific event is that causes the memory use to increase so it can be replicated.

arganto · November 15, 2024, 3:39pm

Currently my gut feeling is, that it comes perhaps from there and perhaps from an integration or UI.

I have now enabled all memory and cpu sensors and will track them im parallel.

arganto · November 18, 2024, 6:55am

Here we go. VSC. The addon takes now more than more memory, never releases it back, add even more on re-opening it. Only add-on restart brings the memory back. The whole system gets otherwise more and more unresponsive, most probably because of memory swapping, etc. Including automatic system-hard-reboots, …

Here a start of the VSC UI. 40%. And Core (blue) and Supervisor (green) got swapped to disk I think. Orange is total up to over 95% what is ofc not healthy for the stability.

I seems to be only vscod-addon ans surprisingly it seems a known problem here, here or here.

Question, because I don’t see any reaction there at all: Is Frenck aware of it or perhaps on vacation or someone else took over?

I know, OT, if this thread is only about integrations causing problems. But wanted to let you and others know, how are seeins similiar and can stop digging with profiler. Thanks for your help anyways, Petro and bdraco.

bdraco · November 18, 2024, 3:51pm

That seems like its a problem with VSCode plugins or VSCode itself. I usually have to close it and restart it a few times a day when using it locally depending on how much I’ve used it. I’d guess its the same for running it on a server. Its probably something that has to be fixed upstream and likely nothing that can be done with the addon itself.

arganto · November 18, 2024, 6:25pm

Could be, but I’m not sure. Because VSC addon was already uptodate the last weeks. The problem only started (at least here) with updating Core and HAOS. So perhaps a new memory handling of HAOS/Supervisor as well.

DaveOBarbaro · November 20, 2024, 7:00pm

Did try to change the templates, but I keep experiencing the memory leaks.

Here is what I captured.
I didn’t include the first entry items, not sure if it is important to view those (e.g., dict had 232,627; list 105,698 ).

Anything that I should look for?

Object	Total Instances Last Entry	Sum of new over 2h	Log Entries
`NodeStrClass`	19,192	2,956	1
`NodeDictClass`	4,042	753	1
`tuple`	156,527	317	10
`NodeListClass`	863	272	1
`cell`	77,540	215	12
`Context`	11,197	194	10
`Input`	278	161	1
`HassJob`	7,496	129	16
`ReferenceType`	34,956	128	18
`builtin_function_or_method`	14,645	122	7
`frozenset`	7,482	122	5
`State`	6,799	119	9
`function`	134,007	119	12
`partial`	7,600	118	10
`method`	17,333	113	12
`set`	20,348	103	6
`BinaryExpression`	477	92	1
`TimerHandle`	1,377	77	5
`BindParameter`	604	62	4
`traceback`	201	52	7

bdraco · November 20, 2024, 7:42pm

That looks like something is holding on to parsed YAML for longer than it should.

DaveOBarbaro · November 21, 2024, 2:50pm

I noticed an error in the logs occurring before the memory began to increase, specifically an issue with the assist pipeline. I recalled that I had previously set up Piper but had disabled it, leaving Atom Echo active. After I disabled it, the RAM consumption ceased to grow. I will continue to investigate to determine if this is the root cause or merely a coincidence.

petro · November 21, 2024, 3:02pm

it definitely is the vscode addon, specifically the custom ha extension in vscode traversing your files. I keep the vscode addon stopped unless I’m using it. It’s been like this for a long time too, not just recently.

RickDangerous · December 10, 2024, 8:02pm

Would it be possible to get some help understanding my profiler logs, as I’m trying to hunt down a big CPU increase [avg 30% on a RPi4, from ~10% beforehand] which has been ongoing for some weeks now (see this post for background)?

I’ve opened the Callgrind file in KCacheGrind but am not sure what to look for/at, and if I try to run gprof2dot -f callgrind -e 0.05 -n 0.25 callgrind.out.1733847020831614 -o callgrind.dot as per the Profiler page, I run into AssertionError (assert abs(call_ratio*total - partial) <= 0.001*call_ratio*total)…

The callgrind log can be found here … the cprof file is a binary though so I’m not sure which would be the preferred way to share it?

Here’s how it looks in SnakeViz:

I let Profiler run for the default 60s btw, maybe I should let it run longer?

Thanks in advance,
RD

bdraco · December 10, 2024, 10:43pm

It does look like there is a lot of time spent in matching mqtt callbacks.

I’d probably do a py-spy as well to compare

RickDangerous · December 11, 2024, 8:31am

Thank you for the reply. Unfortunately, py-spy fails to run and outputs the error

$ ./py-spy top  --pid 67
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: ParseIntError { kind: PosOverflow }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/proc-maps-0.2.1/src/linux_maps.rs:81:65
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Error: receiving on a closed channel

I followed the guide found here, the second last post contains another person running into the same problem a while ago, but no solution unfortunately.

RickDangerous · December 11, 2024, 4:21pm

Now looking at MQTT, it definitely seems problematic as it showsan incredible amount of devices, far too many…:

Thanks for looking at the profiler data and helping to find this.

(Maybe I should “leave” this topic and create a new one to ask for help with this issue now…?)

Krieger2690 · January 6, 2025, 9:48am

Hello, I am investigating a memory leak on my own system right now. Could you tell me where you got the above list from? (the one where it states that you have sensor (6637 topics). Thank you

Krieger2690 · January 6, 2025, 4:39pm

I’ve been noticing RAM usage increase on my system as well. I have ran the diagnostics recommended in the Tracking down a memory leak of python objects section of this post and ran the profiler over a period of time (longer than 1 hour). Than stopped the profiler. Here is the result from the logs: dpaste: GREVZR9LF

However, I am unable to determine what might lead to the memory usage incrase based on this data. This output does not tell me anything. I don’t know where to look anymore.

RickDangerous · January 7, 2025, 6:01am

Hello,

That list comes from MQTT Explorer (Addon MQTT Explorer new Version).

Good luck finding your memory leak! I don’t have any pointers unfortunately, more than to mention that my (cpu increase, not memory though) was related to a LilyGo32 LoRa USB device which for some reason added all these MQTT entries. I deleted them all (through MQTT Explorer) and the cpu usage decreases significally. I have since stopped using this device.

MS27HA · January 8, 2025, 11:26pm

Having similar issues and I’m trying to get a hold of it.
I’ve pasted my logs in ChatGPT but was not much smarter afterwards…

We’re not alone here: Home assistant memory leak · Issue #123831 · home-assistant/core

Krieger2690 · January 11, 2025, 2:56pm

Thank you very much for the heads up. I went over and voted for the issue on GitHub.