Debugging random slowdowns and system freezes on an RPi3B

Hi,

I have a little problem with my HassOS installation on an RPI3B. Every now and then, the system will freeze or be extremely slow. I have seen it often around updates which fail then because the update process protests that it doesn’t have internet.

The observability of HassOS sucks really badly, unfortunately. Luckily, I just found a telegraf addon and thanks to that I could see that last night the system hung from shortly after midnight until shortly before 8 in the morning. I don’t see anything suspicious, though, like slowly increasing RAM usage of a container or something like that. When it came back, the homeassistant container seemed to be hogging CPU, though but not for long.

So I check the host logs and found that the OOM killer has been killing Python3. I can’t tell, which Python3 that was, though. Homeassistant? Supervisor? Oh, and apparently, it produced a coredump… That would certainly explain the lockup… Producing a coredump even slows down my Ryzen 7 desktop. I can see how it would lock up a poor little Pi for hours…

[75880.681731] telegraf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[75880.681755] CPU: 2 PID: 13176 Comm: telegraf Tainted: G         C        5.10.92-v8 #1
[75880.681761] Hardware name: Raspberry Pi 3 Model B Rev 1.2 (DT)
[75880.681767] Call trace:
[75880.681783]  dump_backtrace+0x0/0x1b0
[75880.681791]  show_stack+0x20/0x30
[75880.681800]  dump_stack+0xec/0x154
[75880.681807]  dump_header+0x50/0x20c
[75880.681816]  oom_kill_process+0x208/0x210
[75880.681823]  out_of_memory+0xec/0x330
[75880.681833]  __alloc_pages_slowpath.constprop.0+0x824/0xba0
[75880.681841]  __alloc_pages_nodemask+0x2a4/0x320
[75880.681847]  pagecache_get_page+0x13c/0x2e0
[75880.681853]  filemap_fault+0x6c8/0xa60
[75880.681862]  ext4_filemap_fault+0x3c/0xa00
[75880.681871]  __do_fault+0x44/0x110
[75880.681878]  handle_mm_fault+0x6b4/0xd90
[75880.681885]  do_page_fault+0x148/0x3e0
[75880.681891]  do_translation_fault+0x60/0x78
[75880.681900]  do_mem_abort+0x48/0xb0
[75880.681908]  el0_ia+0x68/0xd0
[75880.681914]  el0_sync_handler+0x98/0xc0
[75880.681923]  el0_sync+0x180/0x1c0
[75880.681981] Mem-Info:
[75880.682000] active_anon:80133 inactive_anon:86000 isolated_anon:0
[75880.682000]  active_file:245 inactive_file:1154 isolated_file:47
[75880.682000]  unevictable:0 dirty:0 writeback:0
[75880.682000]  slab_reclaimable:10355 slab_unreclaimable:12195
[75880.682000]  mapped:157 shmem:4 pagetables:3816 bounce:0
[75880.682000]  free:4513 free_pcp:0 free_cma:0
[75880.682014] Node 0 active_anon:320532kB inactive_anon:344000kB active_file:980kB inactive_file:4616kB unevictable:0kB isolated(anon):0kB isolated(file):188kB mapped:628kB dirty:0kB writeback:0kB shmem:16kB writeback_tmp:0kB kernel_stack:12160kB all_unreclaimable? no
[75880.682032] DMA free:18052kB min:53248kB low:57344kB high:61440kB reserved_highatomic:0KB active_anon:320532kB inactive_anon:344000kB active_file:1024kB inactive_file:4356kB unevictable:0kB writepending:0kB present:970752kB managed:931500kB mlocked:0kB pagetables:15264kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[75880.682040] lowmem_reserve[]: 0 0 0 0
[75880.682082] DMA: 1220*4kB (UMEC) 457*8kB (UME) 337*16kB (UE) 119*32kB (UME) 19*64kB (ME) 2*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 19208kB
[75880.682218] 1814 total pagecache pages
[75880.682260] 310 pages in swap cache
[75880.682269] Swap cache stats: add 636466, delete 636156, find 47616/567183
[75880.682277] Free swap  = 0kB
[75880.682284] Total swap = 232872kB
[75880.682299] 242688 pages RAM
[75880.682306] 0 pages HighMem/MovableOnly
[75880.682314] 9813 pages reserved
[75880.682321] 16384 pages cma reserved
[75880.682329] Tasks state (memory values in pages):
[75880.682337] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[75880.682371] [    111]  1001   111     2035      229    40960      115          -900 dbus-daemon
[75880.682384] [    112]     0   112   269780       96   126976      455             0 os-agent
[75880.682398] [    117]     0   117    33742      104   204800      131          -250 systemd-journal
[75880.682411] [    121]     0   121    97108      349   122880      324             0 udisksd
[75880.682423] [    138]     0   138     3581       21    65536      386         -1000 systemd-udevd
[75880.682443] [    340]  1005   340    21524       33    61440      148             0 systemd-timesyn
[75880.682456] [    343]     0   343   247903      656   167936      352             0 NetworkManager
[75880.682468] [    347]     0   347    58384        0    73728      261             0 rauc
[75880.682480] [    353]     0   353    75250        0    65536      123             0 rngd
[75880.682492] [    355]     0   355     2938       36    61440      144             0 systemd-logind
[75880.682505] [    356]     0   356     2352       62    49152      108             0 wpa_supplicant
[75880.682518] [    400]     0   400      530        7    32768       21             0 hciattach
[75880.682571] [    402]     0   402     1881       39    40960       62             0 bluetoothd
[75880.682583] [    437]     0   437   789067    10724   630784     3279             0 dockerd
[75880.682596] [    445]     0   445   387017     1499   258048     1067             0 containerd
[75880.682608] [    855]     0   855   287102        0   122880      221             0 docker-proxy
[75880.682621] [    862]     0   862   287038        0   122880      182             0 docker-proxy
[75880.682633] [    876]     0   876   418327      506   217088      367             1 containerd-shim
[75880.682646] [    895]     0   895       49        0    28672        5             0 s6-svscan
[75880.682659] [    987]     0   987       49        0    28672        4             0 s6-supervise
[75880.682673] [   1146]     0  1146       49        0    28672        3             0 s6-supervise
[75880.682686] [   1150]     0  1150   177333      181    86016      865             0 observer
[75880.682699] [   1176]     0  1176   315210     1357   221184      518             0 docker
[75880.682711] [   1177]     0  1177      827        1    40960       31             0 hassos-cli
[75880.682724] [   1227]     0  1227   418391      568   217088      364             1 containerd-shim
[75880.682736] [   1246]     0  1246       49        0    28672        4             0 s6-svscan
[75880.682749] [   1345]     0  1345       49        0    28672        3             0 s6-supervise
[75880.682761] [   1536]     0  1536       49        0    28672        4             0 s6-supervise
[75880.682773] [   1537]     0  1537       49        0    28672        3             0 s6-supervise
[75880.682786] [   1540]     0  1540    43941     9734   372736     7723             0 python3
[75880.682798] [   1541]     0  1541     1094      123    40960      397             0 bash
[75880.682810] [   1705]     0  1705   418391      555   212992      366             1 containerd-shim
[75880.682823] [   1727]     0  1727       49        0    28672        3             0 s6-svscan
[75880.682836] [   1811]     0  1811   370842     1379   249856      516             0 docker
[75880.682848] [   1823]     0  1823       45        0    16384        4             0 foreground
[75880.682860] [   1824]     0  1824       49        0    28672        3             0 s6-supervise
[75880.682872] [   1836]     0  1836       44        0    16384        3             0 foreground
[75880.682884] [   1893]     0  1893      653        1    36864       87             0 cli.sh
[75880.682897] [   1999]     0  1999      411        0    32768       11             0 sleep
[75880.682909] [   2015]     0  2015   418391      540   212992      370             1 containerd-shim
[75880.682928] [   2035]     0  2035       49        0    28672        5             0 s6-svscan
[75880.682942] [   2114]     0  2114       49        0    28672        3             0 s6-supervise
[75880.682955] [   2272]     0  2272   418391      526   217088      391             1 containerd-shim
[75880.682968] [   2294]     0  2294       49        0    28672        6             0 s6-svscan
[75880.682980] [   2333]     0  2333       49        0    28672        3             0 s6-supervise
[75880.682992] [   2337]     0  2337   180002     4654   159744      317             0 coredns
[75880.683004] [   2423]     0  2423       49        0    28672        3             0 s6-supervise
[75880.683017] [   2605]     0  2605   418391      532   208896      346             1 containerd-shim
[75880.683029] [   2657]     0  2657       49        0    28672        4             0 s6-svscan
[75880.683042] [   2718]     0  2718       49        0    28672        4             0 s6-supervise
[75880.683054] [   2978]     0  2978       49        0    28672        3             0 s6-supervise
[75880.683066] [   2983]     0  2983      218        8    28672        3             0 mdns-repeater
[75880.683114] [   3085]     0  3085       49        0    28672        3             0 s6-supervise
[75880.683126] [   3086]     0  3086       49        0    28672        3             0 s6-supervise
[75880.683139] [   3089]     0  3089     1080        1    36864      504             0 bash
[75880.683151] [   3090]     0  3090    23668      158    86016      504             0 pulseaudio
[75880.683163] [   3116]     0  3116     1081        0    36864      502             0 bash
[75880.683176] [   3117]     0  3117     1256        1    49152       80             0 udevadm
[75880.683188] [   3124]     0  3124      501        4    36864       98             0 rlwrap
[75880.683200] [   3125]     0  3125      427        0    36864       11             0 cat
[75880.683216] [   3375]     0  3375   287102        0   118784      178             0 docker-proxy
[75880.683258] [   3382]     0  3382   305614        0   126976      213             0 docker-proxy
[75880.683270] [   3397]     0  3397   418455      545   208896      356             1 containerd-shim
[75880.683282] [   3417]     0  3417       49        0    28672        4             0 s6-svscan
[75880.683295] [   3505]     0  3505       49        0    28672        4             0 s6-supervise
[75880.683307] [   3990]     0  3990       49        0    28672        3             0 s6-supervise
[75880.683319] [   3991]     0  3991       49        0    28672        4             0 s6-supervise
[75880.683332] [   3993]     0  3993     5383       14    73728     4171             0 ttyd
[75880.683344] [   3995]     0  3995     1079        0    36864      118             0 sshd
[75880.683356] [   4579]     0  4579   418391      545   208896      350             1 containerd-shim
[75880.683369] [   4629]     0  4629       49        0    28672        5             0 s6-svscan
[75880.683381] [   4743]     0  4743       49        0    28672        4             0 s6-supervise
[75880.683394] [   4899]     0  4899       49        0    28672        3             0 s6-supervise
[75880.683406] [   4902]     0  4902     6306       81    86016     3665             0 hass-configurat
[75880.683425] [   8202]     0  8202   418391      553   217088      344             1 containerd-shim
[75880.683437] [   8222]     0  8222       49        0    28672        4             0 s6-svscan
[75880.683449] [   8265]     0  8265       49        0    28672        3             0 s6-supervise
[75880.683462] [   8414]     0  8414       49        0    28672        3             0 s6-supervise
[75880.683474] [   8417]     0  8417   208494    98743  1642496    13090             0 python3
[75880.683487] [  12672]     0 12672   418327      524   225280      338             1 containerd-shim
[75880.683499] [  12691]     0 12691       49        0    28672        4             0 s6-svscan
[75880.683512] [  12733]     0 12733       49        0    28672        4             0 s6-supervise
[75880.683525] [  13143]     0 13143       49        0    28672        3             0 s6-supervise
[75880.683538] [  13146]     0 13146  1242507     3755   307200      986             0 telegraf
[75880.683551] [  15272]     0 15272   418327      571   212992      333             1 containerd-shim
[75880.683563] [  15292]     0 15292       49        0    28672        7             0 s6-svscan
[75880.683575] [  15335]     0 15335       49        0    28672        4             0 s6-supervise
[75880.683589] [  15774]     0 15774       49        0    28672        3             0 s6-supervise
[75880.683602] [  15775]     0 15775       49        0    28672        3             0 s6-supervise
[75880.683614] [  15779]     0 15779     1449        1    40960      188             0 nginx
[75880.683627] [  15778]     0 15778    66716        1   385024     2987             0 npm start --set
[75880.683675] [  15826]     0 15826    89771    13675  1089536     8363             0 node-red
[75880.683688] [  15947]     0 15947     1507       16    40960      234             0 nginx
[75880.683702] [  75231]     0 75231   289090      317   147456       15             0 runc
[75880.683714] [  75232]     0 75232   288738      333   143360        3             0 runc
[75880.683726] [  75233]     0 75233   288674      389   135168       22             0 runc
[75880.683738] [  75234]     0 75234   270290      338   131072        2             0 runc
[75880.683751] [  75235]     0 75235   270578      208   135168       22             0 runc
[75880.683763] [  75236]     0 75236   288674      357   143360        9             0 runc
[75880.683775] [  75237]     0 75237   288674      328   143360        3             0 runc
[75880.683788] [  75238]     0 75238   288674      368   139264       22             0 runc
[75880.683800] [  75239]     0 75239   288738      347   143360        4             0 runc
[75880.683813] [  75240]     0 75240   270226      338   131072        4             0 runc
[75880.683825] [  75292]     0 75292   270290      325   131072       35             0 runc
[75880.683837] [  75293]     0 75293   270642      385   139264       25             0 runc
[75880.683850] [  75295]     0 75295   270226      321   135168       11             0 runc
[75880.683863] [  75296]     0 75296   270226      317   135168        7             0 runc
[75880.683875] [  75297]     0 75297   270354      321   135168       11             0 runc
[75880.683887] [  75298]     0 75298   288738      322   143360       18             0 runc
[75880.683934] [  75299]     0 75299   270290      332   131072        7             0 runc
[75880.683947] [  75300]     0 75300   288738      314   143360       23             0 runc
[75880.683959] [  75304]     0 75304   270578      207   126976       35             0 runc
[75880.683972] [  75305]     0 75305   270642      312   135168       18             0 runc
[75880.683984] [  75308]     0 75308   288674      160   139264       32             0 runc
[75880.683997] [  75309]     0 75309   270642      305   135168       21             0 runc
[75880.684009] [  75310]     0 75310   270642      318   139264       16             0 runc
[75880.684021] [  75312]     0 75312   288738      316   143360       10             0 runc
[75880.684034] [  75313]     0 75313   289090      306   135168       43             0 runc
[75880.684049] [  75315]     0 75315   288738      301   139264       29             0 runc
[75880.684062] [  75316]     0 75316   270226      325   126976        8             0 runc
[75880.684074] [  75319]     0 75319   270290      303   131072       24             0 runc
[75880.684086] [  75323]     0 75323   288802      308   139264       22             0 runc
[75880.684099] [  75327]     0 75327   270226      317   131072       33             0 runc
[75880.684111] [  75331]     0 75331   270290      313   131072       16             0 runc
[75880.684124] [  76220]     0 76220     2822      103    45056        0             0 systemd-coredum
[75880.684136] [  76342]     0 76342     1094      125    40960      395             0 bash
[75880.684149] [  76343]     0 76343     1358      126    40960        0             0 curl
[75880.684164] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=158eefef5b9d1fd8d17343bd719d3c1a07d228d598b19d7e65b433f6d4fc9757,mems_allowed=0,global_oom,task_memcg=/docker/42cd56a53fb2a0d53362bb21d2cc9d994767270e27cd17e673cca4fd7a96b243,task=python3,pid=8417,uid=0
[75880.684661] Out of memory: Killed process 8417 (python3) total-vm:833976kB, anon-rss:394972kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1604kB oom_score_adj:0
[75881.073776] oom_reaper: reaped process 8417 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[75889.596406] systemd-coredump[76220]: Failed to get COMM: No such process
[75890.408397] audit: type=1701 audit(1645335256.608:223): auid=4294967295 uid=0 gid=0 ses=4294967295 subj==unconfined pid=117 comm="systemd-journal" exe="/usr/lib/systemd/systemd-journald" sig=6 res=1
[75890.834466] systemd-coredump[76514]: Process 117 (systemd-journal) of user 0 dumped core.

Sidenote: I know, this dump says telegraf invoked the OOM killer, but the problem existed before. I installed telegraf because of the problem.

And for the observability… The “observer” just tells me

Supervisor:	Connected
Supported:	Unsupported
Healthy:	Unhealthy

Right… Unsupported? Probably because of the telegraf addon? Unhealthy? Öh, yeah, I know, but WHY and HOW?

Active addons, by the way, are File Editor, NodeRED, Telegraf and Terminal. I even don’t use Samba.
So far it’s obvious that it’s a memory issue but lacking insight into the architecture of HassOS and observability, I have no idea how to track it down.

How did you install Home Assistant OS?

Did you use the pi image?

Affirmative.

So, after watching the system a little while and going over some logs I scrapped 2 things:

  1. I had a sensor for my ADSB receiver created by NodeRED via the NodeRED addon from another machine. That sensor contained the number of planes in sight and an array of those planes as attributes. Usually, they weren’t more than 16 but I thought, let’s scrap it anyways, just for testing.
  2. I was using feedreader and feedparser both pointed to the same 7 feeds, updating every 5 minutes. I remembered that another Python project I’m using which also heavily works with XML data was blowing up the memory due to a memory leak in python-lxml2. After scrapping those 2, the avg. CPU usage of the HA container dropped from 27% to 12% and the memory usage from 330 to 250MB.

So far, that looks pretty promising. For all the feed stuff, btw., I spun up a Huginn container on an x86-64 server. Huginn is much better for those kinds of stuff, anyways.

Another freeze… According to Telegraf, the HA container was trying to get 2.67 Exabyte of RAM…!!! Ääääh, what?

I disabled the last suspect now, which was a geo_location entity fed from a GeoJSON feed which could turn out rather big - not 2.67 Exabytes, though. Let’s see…

OK, now the system is healthy and supported? I suppose I have to read up on what “supported” means…

More freezing. This time just swapping hell, no OOM killer involved. I kicked out Blitzortung and the stock weather app. Average mem usage is now around 70%. Let’s see how long this takes to fill up.

If memory is an issue, increasing the swap file size might help:

Github issue: how can i increase my swap file size? #968

There’s also a thread on how to make the change survive system reboots.

Incidentally, I should add that you can monitor some of this stuff via the System Monitor platform. No need for an additional addon that might also be contributing to the memory issues!

Oh cool! Very hacky but cool! Thanks! I scored an RPi4 8GB today thanks to rpilocator.com, so I hope the debugging has an end soon. It was an interesting hunt, though.
It might be time to remove the RPi3 from the list of recommended or supported boards for HA OS…

I have seen the system monitor platform but as far as I can tell, it cannot tell you the metrics for each container and I wanted to find out which container is using how much RAM and if one is maybe misbehaving.

Honestly, you seem to know your way around Linux, just dump HAOS and install HA container or HA core on a stock RPi OS or Debian install. I’m running core on a RPI3 and the both CPU and RAM usage are tiny and I never experienced any stability issues. Observability is obviously much better than HAOS too.

FWIW, I recently tried to install HAOS on a spare RPI3 for dev and testing a custom card of mine. I encountered countless weird problems, the system would sometimes start and sometimes not with some supervisor error and randomly stop responding. I thought screw that, slapped on an RPI OS image and installed HA core, and things were instantly stable.

Not sure if swapping onto an SD card is such a good idea…

Yeah, should have checked OP was running from an SSD (which would be another path well worth investigating in terms of improving the overall stabillity - certainly helped with my RPi3 HAOS installation)

Funny thing is that I had it running perfectly like that… I manage my containers with Portainer and I had Ubuntu server installed on exactly that Pi, running Portainer edge agent, telegraf as an edge stack, HA and NodeRED. I thought, I give HAOS a try because I like - among other things - that NodeRED is integrated in the HA UI and auth, so I don’t have to maintain separate auth and Traefik profiles. I’m starting to wonder if that was smart. Besides the observability issue, I really don’t like that you can’t easily get ssh access to the actual system level. I understand why it was done like that, but as a 27 years Linux admin and a 26 years Linux desktop user, I just don’t feel happy if I can’t get root access to the bare metal :smiley:
But the problems are hopefully gone when I swap the Pi 3 for an 8GB Pi4. And then I probably connect the HAOS Pi to my console server and hopefully can get remote access to the bare metal. Cause I don’t wanna start putting HDMI switchers and monitors into my home rack and run there everytime I wanna do anything.

Yeah I mean I get why they have HAOS and why they present it as the primary recommended way to run HA. Having an appliance like approach is certainly the best way for new non-technical users and also for pros who just don’t want to bother with yet another device they would have to maintain.

But the HAOS devs clearly aren’t experienced distro builders and maintainers and it shows. I find it hard to believe how a stripped down buildroot install running a few containers, and that’s what HAOS ultimately is, can be so inefficient and such a resource hog. I’m running a full desktop env with an X stack and LXDE on my HA RPI3, along with dev tools and tons of unrelated other stuff and it still uses only a fraction of the resources a stock HAOS install does.

Updating to a Pi4 will obviously help (a lot). I personally refuse to update the HW simply because of an inefficent OS though, but I’m an efficiency and optimization fanatic, so I’m weird with things like that :grin:

Encountering this issue too since some days. Anyone can connect it to any update?

Had the same behavior about 6 month ago, but upgraded the pi 3b+ to SSD and used the hacky swap increase thing. Been running buttery smooth but suddenly issues again. Although swap is only using 12 %, ram is around 80 or higher always and the system freezes once a day.

Anyone found any solution other then upgrading hardware or installing HA core on another os?

I have a Pi3+SSD install and finally traced general ‘performance’ issues down to the amount and data being stored/retrieved, particularly by high frequency sensors.

I have a Tesla Powerwall integration which was recently ‘updated’ causing the 12 sensors provided in the integration to update once every 5 seconds, a 6 fold increase on the previous update rate of once every 30 seconds. This is great in terms of accuracy for energy use calculations, but no to good for displaying that data in the UI. My own ‘Energy’ dashboard became very slow bringing up the data, often causing the core to reboot when viewing the page. So, I removed these sensors from the recorder, and setup equivalent ‘lower frequency’ sensors I use for displaying the info on various dashboards, eg.

    powerwall_site_now_rounded:
      friendly_name: "Grid"
      unit_of_measurement: "kW"
      value_template: "{{ states('sensor.powerwall_site_now') | round(1) }}"

This dramatically cuts down the amount of data stored by the sensors and makes the dashboards displaying the data instantly responsive. I did the same for processor, memory, disk and swap usage, all of which are displayed in the UI. The issue there was clicking on any of these sensors in the UI pulls 24hrs worth of data to display in the history graph of the entity, which could be 100K data points. This caused the CPU usage to spike, followed by various ‘integration took more than 10 seconds to respond’ messages in the logs.

I’m guessing there is a threshold somewhere in the codebase for pulling data like this, where retrieving, for example, 10K data points is OK, but if you want 12K’s worth of data it has to go off and allocate more memory, resize buffers etc., causing a resource limited platform like the Pi3 to grind to a halt.

Or it could be something totally unrelated, just speculating here… Suffice to say that anything that triggers large volumes of data to be retrieved from the database and sent client side browser is bad…

Overall, in the scheme of things these data volumes are tiny, and I’ve seen apps that pull millions of data points into client side browsers for processing without issue. What we need is a drive to make the platform more efficient instead of telling people to buy ‘better’ hardware.

Yesterday I finally migrated my HA to a Pi4/8GB. Immediately noticed something interesting today…: Out of nothing, nodered suddently started to go wild on CPU usage for some hours last night. There was nothing special going on during this time… The Pi4 just shrugged that off but a Pi3 would probably have locked up…