2024.5+: Tracking down instability issues caused by integrations

I came here as hint because of my used memory increase starting from installation 2024.11.

image

Petro suggested to do something from the hints above to dig into the causing integration or whatever is the reason for this.

Unfortunately, and sorry for that, but I don’t get from which hints I get such more information. At least for the descriptions neither the debug mode (for system crashes, restarts, …) nor the memory leak analys (here is no increading leak, it is directly there from system start) fits.

But most probalby I missed or don’t yet understand the right one.

Can someone give me a small hint, which procedure to follor to get more info about the doubling (and more) of my HA Blue memory?

Use the tracking down a memory leak of python objects.

Then try this comment

That will at least give you logs that you can post here for further help

So, here we go

I do not understand how to read it to find, where the memory is allocated.

I started directly after a system reboot (device, not only core). There it was around 45-50% (as always in the past). I left the system alone with profiler on. At the end I clicked through all pages and areas in UI of HA to see changes. Most time it stayed around 55%. I ended with 80-90% (have a gut feeling when it jumped rapitely, but have to double-check) before I stopped the profile log.

Can you or one see from the logs, where all the additional memory is taken?

image

It’s kind of a pain in the ass to read, but each memory growth is a list of the following: name of growth, total current size, increase in growth. The data is sorted by what had the largest growth. So, we are simply looking at the first or second elements after the phrase Memory Growth that are large. The first one is when it starts, you can ignore that one. It tells you everything that’s loaded at start thought.

At 12:21:41, there’s a huge jump in a lot of data. Same thing at 14:43.11. I’m not seeing anything that stands out. Maybe Nick has some incite.

1 Like

Nothing stands out in the data. Usually when there is an obvious leak each cycle will show more of the same object types over and over. 'set', 120957, 2 is a bit high but thats not specific cause for concern.

I’d do an profiler.start next.

Sadly this one looks like one of the harder ones to track down. It may require figuring out what the specific event is that causes the memory use to increase so it can be replicated.

Currently my gut feeling is, that it comes perhaps from there and perhaps from an integration or UI.

I have now enabled all memory and cpu sensors and will track them im parallel.

Here we go. VSC. The addon takes now more than more memory, never releases it back, add even more on re-opening it. Only add-on restart brings the memory back. The whole system gets otherwise more and more unresponsive, most probably because of memory swapping, etc. Including automatic system-hard-reboots, …

Here a start of the VSC UI. 40%. And Core (blue) and Supervisor (green) got swapped to disk I think. Orange is total up to over 95% what is ofc not healthy for the stability.

image

I seems to be only vscod-addon ans surprisingly it seems a known problem here, here or here.

Question, because I don’t see any reaction there at all: Is Frenck aware of it or perhaps on vacation or someone else took over?

I know, OT, if this thread is only about integrations causing problems. But wanted to let you and others know, how are seeins similiar and can stop digging with profiler. Thanks for your help anyways, Petro and bdraco.

That seems like its a problem with VSCode plugins or VSCode itself. I usually have to close it and restart it a few times a day when using it locally depending on how much I’ve used it. I’d guess its the same for running it on a server. Its probably something that has to be fixed upstream and likely nothing that can be done with the addon itself.

Could be, but I’m not sure. Because VSC addon was already uptodate the last weeks. The problem only started (at least here) with updating Core and HAOS. So perhaps a new memory handling of HAOS/Supervisor as well.

Did try to change the templates, but I keep experiencing the memory leaks.

Here is what I captured.
I didn’t include the first entry items, not sure if it is important to view those (e.g., dict had 232,627; list 105,698 ).

Anything that I should look for?

Object Total Instances Last Entry Sum of new over 2h Log Entries
NodeStrClass 19,192 2,956 1
NodeDictClass 4,042 753 1
tuple 156,527 317 10
NodeListClass 863 272 1
cell 77,540 215 12
Context 11,197 194 10
Input 278 161 1
HassJob 7,496 129 16
ReferenceType 34,956 128 18
builtin_function_or_method 14,645 122 7
frozenset 7,482 122 5
State 6,799 119 9
function 134,007 119 12
partial 7,600 118 10
method 17,333 113 12
set 20,348 103 6
BinaryExpression 477 92 1
TimerHandle 1,377 77 5
BindParameter 604 62 4
traceback 201 52 7

That looks like something is holding on to parsed YAML for longer than it should.