HAOS VM intermittently freezes at night: history stops, UI unreachable

Setup:

  • Home Assistant OS as a VM on Proxmox, HA across 3 nodes
  • Timezone: Europe/Berlin
  • Access constraint: I only have core_ssh

Symptoms:

  • On multiple nights, at a varying time during the night, history recording stops (e.g., temperatures flatline).
  • By morning the HA web UI is unreachable.
  • In Proxmox the VM still shows “running.”
  • A VM reboot immediately restores normal operation (data gap remains).

What I’ve checked (quick):

  • Host (HAOS) logs around incidents: no OOM, no kernel panic, nothing obvious.
  • Core logs: no recorder/database/sqlite/sqlalchemy/history errors at the time of the stall.
  • “Hypervisor initiated shutdown” only appears when I reboot in the morning (so not the cause).
  • Mosquitto “API not ready” messages only during boot, not at the failure times.

What I’m looking for (open-ended):

  • Ideas on where to look first on Proxmox and HA when the VM is “running” but UI/History stall at night.
  • Tips to capture evidence while it happens (lightweight logs/alerts I can enable from core_ssh and from Proxmox).
  • Known gotchas matching these symptoms (snapshots, HA failover/migrations, storage hiccups, etc.) — real-world cases welcome.
  • Minimal workarounds that helped you stop similar night-time stalls.

Happy to share sanitized log snippets if helpful. Thanks!

Do you have nightly VM backups enabled per chance? I’m not sure how proxmox handles it exactly but hypervisors typically do a VM stun (temporarily unscheduled VM), take a snapshot, resume the VM, copy the snapshot to a backup, and remove the snapshot.

My guess would be something breaking in that process. Could be that you’re out of space for the snapshot (so the VM remains stunned). Could be that the resume never triggers because of some other reason.

Perhaps not as detailed of a response as you’re after (I don’t know proxmox logging, unfortunately). Hopefully it gives somewhere to get started with something to rule in/out though.

Hi @erict7 — thanks for the suggestion.

  • I checked Proxmox and excluded automatic snapshots/backups (they were disabled).
  • I checked the logs and found nothing notable.
  • The VM now has 12 GB fixed RAM (ballooning disabled).

That is all I have verified so far. If you can point to specific log files/entries or a targeted command you want me to run (e.g. under /var/log/pve/tasks/ or a qm agent check), I will run it and post the sanitized results.

You are using VLAN. Make a copy of the VM and try without VLAN and without setting MTU. You do not need 6 cores, 4 are enough. You also try to set the processor to host or kvm74 (if memory serves me well your current AES setting can cause issues to HA).
See below an example of a working VM.