Raspberry Pi 4 hangs loading HAOS

I’ve had haos running in a VM on a raspberry pi 4B (8GB) for several months, with no issues. It’s a bit of an odd setup, but basically it’s Ubuntu 22.04 64bit, running some other stuff, and then haos running in a qemu/libvirt VM. When I first installed it, I used the haos_generic-aarch64-16.3.qcow2 image. Everything has been fine for a while, I’ve done several updates, etc., all with no problem.

Recently it stopped working, and I found the rpi had completely frozen up. I restarted with a monitor attached, and it halts in about the same place on every boot, while loading the HAOS vm. So I put it in safe mode, disabled libvirt/qemu, and everything else starts up fine and is stable. When I try to start the haos vm, the system halts about 2-3 seconds later - I haven’t yet been able to get the VM’s console started in time to see the messages, but it hard-locks the system with no obvious logs of what’s going on.

This is where it gets strange(r).
I tried to create a new VM, with the same image I originally used, and the same commands/options, in case it was an haos image that broke things. That image halts and crashes the pi in the exact same manner. Thinking maybe it was a hardware issue, I tried a brand new out of the box RPi4 8Gb - same issue.

There are no USB devices connected, apart from a wireless keyboard dongle I added while troubleshooting (it was not there for the initial crashes). The RPi is connected to wired ethernet, and using what I believe to be a solid/stable power supply (it’s a PoE dongle rated for 2.4A/5V output, but I’ve gotten no voltage warnings or errors while using it, as opposed to the UPS the RPi was briefly plugged into which sprayed frequent voltage warnings). The case is a GeekPi passive heatsink case, and temperatures don’t appear to be approaching anything that would cause concern. The other workloads on this Pi are minimal also - a TPLink Omada controller, UNMS/UISP (whatever they call it this week) running in docker, and tailscale. I’ve stopped all those and started up the VM, and still got the same system hang.

The only other thing that’s changed since initial install is some of the Ubuntu packages, but I did not perform any system updates/upgrades between when HAOS was working, and when it started crashing the system.

If it were just the VM hanging, I could at least try to troubleshoot that, but much harder when it’s the entire RPi… any suggestions beyond moving to an RPi5 or other hardware, or adding a second RPi dedicated to HAOS (which I’d prefer not to do, but… could be an option I suppose).

SD card died… They all die. They are not designed to run a database with a lot of writes, which HA does.

This SD card is not that old, and it’s a 256GB Samsung card without much on it - lots of flash for it to spread the writes over. At this point it probably hasn’t even completed one full DW. Additionally everything works fine up until starting the HAOS VM image; a new image (like 16.3) crashes in the same way. The original 16.1 file, but a new image created from it, crashes in the same way. There are no write or read errors coming from the uSD card. So, while I think that’s absolutely a common culprit, I don’t think it’s the issue here.

Nonetheless, I’ll grab a new uSD (I have a pack of 64Gb cards here) and put a new Ubuntu install on it and see if I can replicate the crash. Thanks for the suggestion.

1 Like

Are ALL the ‘my new install hangs’ issues getting reported in these forums related? Ive seen a few already today. Starting to see a common pattern emerging.

Check the time on your system. The VM time. Tell us what it is. Is the network stable?

Preserve your system logs. Report a problem on GitHub with CORE HA. I suspect it is faulty.

Workaround most people try is to reboot, and it seems to catch up and continue.

Keep us posted.

Im not sure which are “Obvious logs” in Ubuntu,you tell me … After you checked them all, maybe reinstall ubuntu, and “apply” your VM-file again … i for sure know, VM in Linux, you should be very careful which package you install.
I’t not like in Windows where i.e VMWare is an isolated APP ( Don’t hate dat Word, we have to Live with Apps everywhere now ) :grin:

Like death and taxes! Inevitable. They usually fail with warnings in the system log, not abruptly. The A2 variants might get you a longer life.

I’ve seen brand new ones out of the bubble wrap mess up.

Whaaat, they don’t stress test them all before selling ?

I think the SD Card might not be the underlying issue here - there are too many ‘VM hanging teports’ for it to be a batch of faulty SD Cards out there, even though that issue should be ruled out by asking the question, and getting a confirmation here, like has been offered.

1 Like

Well if it’s a fake branded with Samsung, then it was tested, rejected, and sold off the back of a truck.

What I mean is - the system completely hangs, so there is nothing visible in /var/log or using journalctl. It hangs early enough in the haos VM boot process that everything in the image is still read-only. The only thing possibly useful I’ve gotten is that sometimes if I leave it sit long enough, I get a bunch of soft lockup - CPU Stuck errors, like these:

Well that one only tells me, it says it “Starved” :grin: ( Maybe it want more “Jiffies” ) or less i don’t know
( But please check IOT7712’s early question ( Your clock ) , but might be likely the cpu couldn’t handle “something” )
I was actually referring to you logs in Ubuntu, not HA’s log , As you say it Hangs the whole system !, but maybe i missunderstand

Just thinking how can a VM Hang a Host ? … Check Your Ubuntu Logs

Off on another tangent:
Can you try something for me?
Very gingerly, because you might get burnt.
Can you just check how warm the processor chip is on your Raspberry Pi? Use the back of your hand before using your index finger.
Warm, hot, nuclear?
Is it heatsinked or fan cooled?

After you have done that, can you try again, this time after it has been fully powered down and cooled for a few minutes.

Right, that’s what I’m trying to understand… how can a VM hang the whole system? And yet, it does. Hence, I don’t have logs from Ubuntu OR the VM, because the whole thing hangs. If it were just the VM, I would at least have something I could poke at with a sharp pointy stick. I assume perhaps it’s passing through some unsupported (or buggy?) CPU instruction, but it’s hard to say for sure.

I’ll try to start it again and see if I can get on the console early enough to tell what it’s doing. The system clock itself is fine/accurate, but it’s hard to tell on the VM because it’s crashing so early (and taking the host system with it).

Heatsink (geekpi passive case - the entire case acts as heatsink). I tried exactly that on a new RPi board after it crashed (I used the same uSD to keep everything else the same), and the chip was warm, maybe uncomfortably so if I held my finger there long enough, but not “melt your flesh off” hot. I’ve run some CPU stress tests in this case and the CPU maxed out around 46C.

Ok, thou i find that unlikely … every OS have, also Debug options

Thanks. Cross that one off the check-list. Gotta pin it down to software, not hardware, and then hone in for the kill.

To be clear - Ubuntu has logs, but only right up to the crash, because the entire system hangs. I’ll dig up what it does have of state pre-crash and report back. I’m happy to turn on debugging, but have very little qemu experience, particularly with debugging… will have to do a bit of research.

Ok, another option is to take out the SD, open it in a laptop/pc dig into the VM-file ( For ha-log )
Thou, I would have reinstalled, OS, And HA.VM from scratch, restored from “older” backup
I already get a feeling that this procedure is more time saving

Gasp!
If I had pearl I would clutch them immediately.

1 Like