HA crashes every few hours

florian.baethge · February 17, 2025, 2:30pm

Hi,
last month I installed HAOS on a RPi5 with SSD and configured the basics. I also installed and configured the Cloudflared AddOn which works fine so far.

After a few days though I’ve noticed that HA crashes and becomes unavailable about once a day, sometimes at night, sometimes during the day. When opening the website or App it still shows the cached version of the dashboards, only the History Graph Card doesn’t load. When I then try to access the configuration.yaml over the File Editor or open the Energy Dashboard, it also cannot load.

However, MQTT is still working. I regularly publish on a topic on my Mosqutto broker (which is running on another local RPi) and that one keeps receiving updates even when the device is down. Other data like history data, energy data etc is not being recorded while being down.

When I then completely reload the website or Companion App, it doesn’t load it anymore. Only way to reset it is to unplug it and plug it in again a bit later. Sometimes I even have to do this twice.

I don’t see anything suspicious in the logs when it’s running though. I cannot join through SSH when it is offline. Ping still works though and it also shows up in my Unifi Network Console.

Any ideas how I can continue to find this bug?

fleskefjes · February 17, 2025, 2:35pm

Start by looking through the logs, home-assistant.log.1 should contain the relevant logs up intil the latest crash.

MaxK · February 17, 2025, 2:46pm

Molodax · February 17, 2025, 3:48pm

I had exactly the same issue with my RPi5 with microSD without any SSD. The only work around that working is to downgrade HASSOS from v 14.x to 13.2.
I’m still trying to find the culprit but it seems related to OS v.14.x (tried 14.1, 14.2 with the same result).

TRS-80 · February 17, 2025, 4:00pm

Just wanted to reiterate this. You have addressed the SD card issue by using SSD instead. But do not fail to heed the power supply warnings. These are the 2 most common issues on SBCs in general. All sorts of “weird” issues just go away when using good quality storage media and power supplies. At least, these should be looked at first / eliminated before looking at other potential issues.

donbaq · February 17, 2025, 6:09pm

How much RAM does your RPi have?

I wonder how the vm.dirty_ratio and friends are set up on HAOS, if it’s the default values you may have run into an issue with write buffers filling up and slowing everything down once they’re flushed. If your disk is not fast enough or reports errors during writes (e.g. not enough power or electrical noise or whatever) the Linux kernel will throttle every disk-accessing process until the queued writes are complete. If processes and data are already in RAM, they’ll work mostly fine.

IOW try lowering your sysctl vm.dirty_ratio to something small especially if you’ve got a lot of RAM so the write queue can’t get too big.

florian.baethge · February 17, 2025, 9:17pm

@MaxK @TRS-80
Thanks for that info.
It’s a new RaspPi 5 with 8GB RAM. The power supply also is a new official 27W Raspberry Pi USB C Power supply.

Using an NVMe PCIe Board on the RaspPi I use a " WD Blue SN580 NVMe SSD 1 TB (PCIe Gen4 x4, up to 4.150 MB/s Read, M.2 2280, nCache 4.0-Technology)".

Or is this harddisk not sufficient?

There are no other peripherals connected to the RaspPi yet. And it’s connected via Ethernet.

florian.baethge · February 17, 2025, 9:22pm

@donbaq It’s an 8GB version. See my other reply with more specs of the used SSD…

I’ve now set up direct SSH access and can read the dirty-params:

# sysctl -a | grep dirty
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 43200

donbaq · February 17, 2025, 10:10pm

Try setting these

vm.dirty_background_ratio = 3
vm.dirty_ratio = 6
vm.dirty_expire_centisecs = 500
vm.dirty_writeback_centisecs = 100

You have a lot of RAM, the defaults don’t make sense.

You also have a very fast disk which should handle everything Home Assistant throws at easily… alas. I can’t promise these will help a lot, but they just might help enough to not freeze the machine.

MaxK · February 17, 2025, 10:23pm

What’s in the logs?

florian.baethge · February 18, 2025, 7:53am

@donbaq Thanks, I will try to change those settings and see if it helps.

@MaxK Last night it went down again for a few hours until I restarted it this morning (last energy data was collected between 1:00-2:00 at night and then I restarted at about 6:34 this morning). Weirdly, even unplugging it and plugging it back in for 2-3 times didn’t load it successfully this morning. So I now also am trying a different power source and instead of the 27W one that came with the set I plugged it into an old MacBook charger with USB-C that delivers 67W. Maybe that helps.

Regarding the logs. Yesterday, it didn’t say anything in the logs I got via these: Debugging the Home Assistant Operating System | Home Assistant Developer Docs

While it was down today, I wasn’t able to connect via SSH (directly, not via Add-On). So now after restarting I can see the following logs on the website:

Supervisor:

2025-02-18 01:41:47.141 INFO (MainThread) [supervisor.resolution.check] Starting system checks with state running
2025-02-18 01:41:47.141 INFO (MainThread) [supervisor.resolution.checks.base] Run check for dns_server_failed/dns_server
2025-02-18 01:41:47.224 INFO (MainThread) [supervisor.resolution.checks.base] Run check for trust/supervisor
2025-02-18 01:41:47.229 INFO (MainThread) [supervisor.resolution.checks.base] Run check for disabled_data_disk/system
2025-02-18 01:41:47.229 INFO (MainThread) [supervisor.resolution.checks.base] Run check for free_space/system
2025-02-18 01:41:47.229 INFO (MainThread) [supervisor.resolution.checks.base] Run check for pwned/addon
2025-02-18 01:41:47.229 INFO (MainThread) [supervisor.resolution.checks.base] Run check for multiple_data_disks/system
2025-02-18 01:41:47.229 INFO (MainThread) [supervisor.resolution.checks.base] Run check for dns_server_ipv6_error/dns_server
2025-02-18 01:41:47.230 INFO (MainThread) [supervisor.resolution.checks.base] Run check for docker_config/system
2025-02-18 01:41:47.230 INFO (MainThread) [supervisor.resolution.checks.base] Run check for ipv4_connection_problem/system
2025-02-18 01:41:47.230 INFO (MainThread) [supervisor.resolution.checks.base] Run check for security/core
2025-02-18 01:41:47.230 INFO (MainThread) [supervisor.resolution.check] System checks complete
2025-02-18 01:41:47.230 INFO (MainThread) [supervisor.resolution.evaluate] Starting system evaluation with state running
2025-02-18 01:41:47.296 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
2025-02-18 01:41:47.296 INFO (MainThread) [supervisor.resolution.fixup] Starting system autofix at state running
2025-02-18 01:41:47.296 INFO (MainThread) [supervisor.resolution.fixup] System autofix complete
2025-02-18 01:43:19.226 INFO (MainThread) [supervisor.homeassistant.api] Updated Home Assistant API token
2025-02-18 01:43:19.650 INFO (MainThread) [supervisor.store.git] Update add-on https://github.com/brenner-tobias/ha-addons repository
2025-02-18 01:43:19.652 INFO (MainThread) [supervisor.store.git] Update add-on https://github.com/hacs/addons repository
2025-02-18 01:43:19.655 INFO (MainThread) [supervisor.store.git] Update add-on https://github.com/music-assistant/home-assistant-addon repository
2025-02-18 01:43:19.660 INFO (MainThread) [supervisor.store.git] Update add-on https://github.com/esphome/home-assistant-addon repository
2025-02-18 01:43:19.663 INFO (MainThread) [supervisor.store.git] Update add-on https://github.com/hassio-addons/repository repository
2025-02-18 01:43:19.670 INFO (MainThread) [supervisor.store.git] Update add-on https://github.com/home-assistant/addons repository
2025-02-18 01:43:20.535 INFO (MainThread) [supervisor.store] Loading add-ons from store: 83 all - 0 new - 0 remove
2025-02-18 01:43:20.535 INFO (MainThread) [supervisor.store] Loading add-ons from store: 83 all - 0 new - 0 remove
s6-rc: info: service s6rc-oneshot-runner: starting
s6-rc: info: service s6rc-oneshot-runner successfully started
s6-rc: info: service fix-attrs: starting
s6-rc: info: service fix-attrs successfully started
s6-rc: info: service legacy-cont-init: starting
cont-init: info: running /etc/cont-init.d/udev.sh
[05:34:23] INFO: Using udev information from host
cont-init: info: /etc/cont-init.d/udev.sh exited 0
s6-rc: info: service legacy-cont-init successfully started
s6-rc: info: service legacy-services: starting
services-up: info: copying legacy longrun supervisor (no readiness notification)
services-up: info: copying legacy longrun watchdog (no readiness notification)
[05:34:23] INFO: Starting local supervisor watchdog...
s6-rc: info: service legacy-services successfully started
2025-02-18 05:34:24.376 INFO (MainThread) [__main__] Initializing Supervisor setup
2025-02-18 06:34:24.424 INFO (MainThread) [supervisor.bootstrap] Setting up coresys for machine: raspberrypi5-64
2025-02-18 06:34:24.428 INFO (MainThread) [supervisor.docker.supervisor] Attaching to Supervisor ghcr.io/home-assistant/aarch64-hassio-supervisor with version 2025.02.1

Now after reboot I can read from journalctl, but only get info from after the restart:

Feb 18 05:34:22 kiwi systemd[1]: Starting HassOS supervisor...
Feb 18 05:34:22 kiwi docker[1100]: hassio_supervisor
Feb 18 05:34:22 kiwi systemd[1]: Started HassOS supervisor.
Feb 18 05:34:22 kiwi hassos-supervisor[1109]: [INFO] Starting the Supervisor...
Feb 18 05:34:23 kiwi hassos-supervisor[1145]: hassio_supervisor

The logs from docker logs homeassistant also don’t show a real error for the timeframe:

2025-02-18 00:14:09.004 WARNING (MainThread) [homeassistant.components.nanoleaf] Received unknown touch gesture ID 0
2025-02-18 00:39:53.391 WARNING (MainThread) [homeassistant.components.nanoleaf] Received unknown touch gesture ID 0
2025-02-18 00:46:57.111 WARNING (MainThread) [homeassistant.components.nanoleaf] Received unknown touch gesture ID 0
2025-02-18 01:00:26.752 WARNING (MainThread) [homeassistant.components.nanoleaf] Received unknown touch gesture ID 0
2025-02-18 01:26:30.231 WARNING (MainThread) [homeassistant.components.nanoleaf] Received unknown touch gesture ID 0
2025-02-18 01:54:54.501 WARNING (MainThread) [homeassistant.components.nanoleaf] Received unknown touch gesture ID 0
s6-rc: info: service s6rc-oneshot-runner: starting
s6-rc: info: service s6rc-oneshot-runner successfully started
s6-rc: info: service fix-attrs: starting
s6-rc: info: service fix-attrs successfully started
s6-rc: info: service legacy-cont-init: starting
s6-rc: info: service legacy-cont-init successfully started
s6-rc: info: service legacy-services: starting
services-up: info: copying legacy longrun home-assistant (no readiness notification)
s6-rc: info: service legacy-services successfully started
2025-02-18 06:35:04.845 WARNING (SyncWorker_0) [homeassistant.util.yaml.loader] YAML file /config/configuration.yaml contains duplicate key "script". Check lines 25 and 162
2025-02-18 06:35:04.903 WARNING (SyncWorker_0) [homeassistant.loader] We found a custom integration landroid_cloud which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant
2025-02-18 06:35:04.904 WARNING (SyncWorker_0) [homeassistant.loader] We found a custom integration liquid-check which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant
2025-02-18 06:35:04.905 WARNING (SyncWorker_0) [homeassistant.loader] We found a custom integration hacs which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant
2025-02-18 06:35:06.069 WARNING (Recorder) [homeassistant.components.recorder.util] The system could not validate that the sqlite3 database at //config/home-assistant_v2.db was shutdown cleanly
2025-02-18 06:35:06.087 WARNING (Recorder) [homeassistant.components.recorder.util] Ended unfinished session (id=93 from 2025-02-17 12:42:24.762334)
2025-02-18 06:35:07.955 WARNING (SyncWorker_4) [homeassistant.util.yaml.loader] YAML file /config/configuration.yaml contains duplicate key "script". Check lines 25 and 162

However, is it normal that when connecting via SSH directly, this still is a read-only filesystem? Whenever I try tro e.g. write logs into a special file or just create a file I get

touch: foo: Read-only file system

# dmesg | grep EXT4
[    1.469020] EXT4-fs (nvme0n1p7): mounted filesystem 8bb15302-089d-45de-bcf9-1e449a4dc0c9 r/w with ordered data mode. Quota mode: none.
[    2.107528] EXT4-fs (nvme0n1p8): mounted filesystem 3862c59c-1810-4a47-83d7-adae7227f111 r/w with ordered data mode. Quota mode: none.
[    2.130874] EXT4-fs (nvme0n1p8): resizing filesystem from 244004017 to 244004017 blocks
[    2.224449] EXT4-fs (zram2): mounted filesystem c17f9bf7-400d-4a69-ac1b-ed6edc84dc4b r/w without journal. Quota mode: none.

# mount | grep ' / '
/dev/nvme0n1p5 on / type erofs (ro,relatime,user_xattr,acl,cache_strategy=readaround)

However, this is while it seems to be running fine for me. I can access the page, edit files via the File Editor etc… only via SSH it says it’s read-only.

Any other tips?

MaxK · February 18, 2025, 12:06pm

As asked above:

What is in home-assistant.log.1

This is the log file you need to read after a crash.

Also, as mentioned in the link I provided earlier, your file system may be corrupted from the crashes. You may need to reinstall fresh and restore from a backup.

florian.baethge · February 18, 2025, 12:36pm

In the /config folder I only have the home-assistant.log and home-assistant.log.1 files. The latter one only contains entries starting this morning from 6:34, after I restarted… and the other one only even newer ones… So no other insights from this sadly. If it crashes again I’ll try to read those out again.

Also, if it keeps hanging up I will probably have to do the solution to reinstall and restore

florian.baethge · February 20, 2025, 8:29am

OK after switching the power source and the vm.dirty settings the system ran fine for over 24h and I was having good hope.

Then again it crashed so I decided to restore from a backup. Downloaded backup, took out the SSD and flashed HAOS on it again, during onboarding selected to restore the backup and was successfully back up running again…

Then this morning while checking again, it has crashed while I was on the page. So there I was able to open the Settings>System>Logs page and see the HomeAssistant Core errors shown there:

Here the logs as detailed text:

gist.github.com

https://gist.github.com/fbaethge/c469970ba65c888075cbe807be946076

gistfile1.txt

Logger: homeassistant.components.recorder.core
Source: components/recorder/core.py:1192
integration: Recorder (documentation, issues)
First occurred: 06:44:50 (76 occurrences)
Last logged: 06:48:32

Error in database connectivity during commit: Error executing query: (sqlite3.OperationalError) disk I/O error (Background on this error at: https://sqlalche.me/e/20/e3q8). (retrying in 3 seconds)
Error in database connectivity during commit: Error executing query: (sqlite3.OperationalError) unable to open database file (Background on this error at: https://sqlalche.me/e/20/e3q8). (retrying in 3 seconds)

-----

This file has been truncated. show original

So it seems to be some kind of database error now? Is there a way to repair/fix this malformed database somehow?

florian.baethge · February 20, 2025, 7:12pm

I’ve now also tried to change the database to MariaDB which worked fine and also changed the commit_interval for the recorder from 5 to 10, in case I was receiving too much sensor load.

However, I still got the database error again.

MaxK · February 20, 2025, 10:12pm

Perhaps delete the database and let it rebuild from scratch?

Edit: also, can you post the output of

Settings → system → repairs → [three dots upper right] → system information

florian.baethge · February 21, 2025, 8:12am

@MaxK When switching to MariaDB, don’t I get a new DB? Still it also crashed there, basically on a new database…

Here the system information from the repairs page:

System information
Version	core-2025.2.4
Installation type	Home Assistant OS
Development	false
Supervisor	true
Docker	true
User	root
Virtual environment	false
Python version	3.13.1
Operating system family	Linux
Operating system version	6.6.62-haos-raspi
CPU architecture	aarch64
Timezone	Europe/Berlin
Configuration directory	/config
Home Assistant Community Store
GitHub API	ok
GitHub Content	ok
GitHub Web	ok
HACS Data	ok
GitHub API Calls Remaining	5000
Installed Version	2.0.5
Stage	running
Available Repositories	1547
Downloaded Repositories	11
Home Assistant Cloud
Logged In	false
Reach certificate server	ok
Reach authentication server	ok
Reach Home Assistant Cloud	ok
Home Assistant Supervisor
Host operating system	Home Assistant OS 14.2
Update channel	stable
Supervisor version	supervisor-2025.02.1
Agent version	1.6.0
Docker version	27.2.0
Disk total	916.2 GB
Disk used	9.2 GB
Healthy	true
Supported	true
host_connectivity	true
supervisor_connectivity	true
ntp_synchronized	true
virtualization	
Board	rpi5-64
Supervisor API	ok
Version API	ok
Installed add-ons	Advanced SSH & Web Terminal (20.0.1), Cloudflared (5.2.9), File editor (5.8.0), Get HACS (1.3.1), Mosquitto broker (6.5.0), Studio Code Server (5.18.2), MariaDB (2.7.2)
Dashboards
Dashboards	3
Resources	8
Views	7
Mode	storage
Network Configuration
Adapters	lo (disabled), end0 (enabled, default, auto), docker0 (disabled), hassio (disabled), veth22f9c17 (disabled), veth2063d35 (disabled), veth026441a (disabled), veth20a3e0f (disabled), veth618dba5 (disabled), veth3e95daf (disabled), veth276899a (disabled), vethd98a322 (disabled)
IPv4 addresses	lo (127.0.0.1/8), end0 (172.16.2.10/16), docker0 (172.30.232.1/23), hassio (172.30.32.1/23), veth22f9c17 (), veth2063d35 (), veth026441a (), veth20a3e0f (), veth618dba5 (), veth3e95daf (), veth276899a (), vethd98a322 ()
IPv6 addresses	lo (::1/128), end0 (fe80::eedf:f2e2:84bf:fa71/64), docker0 (fe80::42:ffff:fef7:b888/64), hassio (fe80::42:9cff:fe19:2729/64), veth22f9c17 (fe80::bc2a:8ff:fe5f:a877/64), veth2063d35 (fe80::8ced:b6ff:fe83:8763/64), veth026441a (fe80::acb1:f9ff:fe07:65bc/64), veth20a3e0f (fe80::f85a:cff:fedd:5f3c/64), veth618dba5 (fe80::4484:47ff:fe87:a6cf/64), veth3e95daf (fe80::4ca6:95ff:fe1c:bb98/64), veth276899a (fe80::e437:b6ff:fe5e:e23c/64), vethd98a322 (fe80::f80c:94ff:feb8:b999/64)
Announce addresses	172.16.2.10, fe80::eedf:f2e2:84bf:fa71
Recorder
Oldest run start time	February 20, 2025 at 13:43
Current run start time	February 21, 2025 at 09:09
Estimated database size (MiB)	26.70 MiB
Database engine	mysql
Database version	10.11.6
Core metrics
Processor usage
0.1 %
Memory usage
6.4 %
Supervisor metrics
Processor usage
0 %
Memory usage
1.6 %

MaxK · February 21, 2025, 11:50am

Does it crash when running in safe mode?

General troubleshooting - Home Assistant.

florian.baethge · February 21, 2025, 5:03pm

I will try that

donbaq · February 22, 2025, 5:07pm

commit_interval for the recorder from 5 to 10

If you have a lot of sensors you should lower this number instead of raising it. You want less data written to disk at the same time, not more.