Hang every day at 0400

bwduncan · February 28, 2024, 12:44am

Recently my HA (Docker, RPi 3, debian bullseye) installation has started hanging at around the same time each day, around 4am. It just grinds to a halt, the logs only have network timeout errors, entity update “is taking over 10 seconds” warnings, and connection errors (even for hosts on the LAN). I do have a ton of custom components: Yes it is probably due to one of those. The process is still running, but one thread is CPU bound at 100%. There’s no excessive memory or FD use. It takes a while to fully die, maybe 20 minutes? I don’t know because I’m asleep. When I wake up the frontend stopped responding, automations stopped, statistics stopped, etc.

How do I debug this? I can run strace against it, but it’s just a bunch of futex_wait calls. It’s in docker so I have created a docker image with gdb in it, but it still doesn’t show me a useful backtrace. All the threads are just in musl and then gdb says the stack frame is corrupted Should I persevere with pdb in a docker image, or is there a better way?

Thanks

WallyR · February 28, 2024, 6:57am

At 04:00 it could be a massive database maintenance task, which will at times require lots of memory and CPU resources.
How much memory does your RPi3 have? 1Gb is really not enough anymore, it seems.

francisp · February 28, 2024, 7:20am

All Pi3 have 1Gb. More memory only started with the Pi 4.

bwduncan · February 28, 2024, 12:14pm

Could be but doesn’t feel like it… Got 386M “available” according to free and it’s not thrashing. It’s not even doing any IO. It’s not filled up swap, and in fact the home assistant process is using zero swap.

If it was a database task I would expect it to eventually complete, or make some progress. In strace I can see that some threads are still receiving mqtt messages, but the main thread is spinning. How can I see what it’s doing?

bwduncan · February 28, 2024, 12:20pm

I turned on debug logging for the custom component which I added most recently, but it is logging timeout errors while everything is grinding to a halt so I don’t think that’s to blame. I really just need to see a python backtrace for the main thread!

It’s bizarre that MainThread is able to send log messages stating the network is dead, and then 20 minutes later everything is stuck

bwduncan · March 1, 2024, 1:26pm

Hung again last night. Still not sure how to debug this.

boheme61 · March 1, 2024, 1:52pm

Free of how much Total ? , if you have 1GB, HA takes about 50% in “normal” use-cases
If you have 2GB, HA takes about roughly 50% in “normal” use-cases
So HA, like other OS’s etc, takes an amount of total into the RAM, for easier/faster access.
IF You then “only” have 386MB left, and a tons of processes( & Add-ons ) starts to do “maintenance” , You most likely run dry ( at around 4am ) , is your “no swapping” / Hanging, based on this time-span ?

francisp · March 1, 2024, 2:05pm

Try py-spy

bwduncan · March 2, 2024, 12:06am

Thanks that looks really promising. Unfortunately it doesn’t work with Python 3.12 yet