Persistent Home Assistant Failure: Configuration.yaml Disappearing Issue

frieck · March 28, 2024, 12:34pm

Hey everyone,

I’ve been encountering a recurring problem with my Home Assistant setup, and I’m at my wit’s end trying to solve it. Every 1 or 2 weeks, my Home Assistant machine suddenly stops responding, and upon attempting to restart, I encounter an error stating that there is no configuration.yaml file.

Here’s a quick rundown of the steps I’ve taken to troubleshoot the issue:

Hardware Swaps: I’ve switched out both the machine itself (now using a new RPi 4 with 8GB of RAM) and the NMM-e card, thinking it might be a hardware issue causing the problem.
Power Supply Change: I’ve also switched to an official Argon One power supply for my M.2 board, ruling out potential power supply issues.

Despite these efforts, the problem persists, and I’m left scratching my head. I’ve scoured forums and documentation but haven’t found a solution that addresses this specific issue.

If anyone has encountered a similar problem or has any insights into what might be causing this configuration.yaml disappearance glitch, I would greatly appreciate your help and expertise. Any suggestions for further troubleshooting steps or potential fixes would be invaluable.

Thank you in advance for any assistance you can provide!

MaxK · March 28, 2024, 12:46pm

Some additional steps you could try:

francisp · March 28, 2024, 1:21pm

Running HA OS or running HA Container (docker) ?

If running Container, check you created a persistent volume.

frieck · April 1, 2024, 2:26pm

I am not sure if the power supply is the issue. That happened with two original power supplies. I still didn’t find any correlation in the logs also… any suggestion on how to find it?

frieck · April 1, 2024, 2:27pm

I am running HA OS.

MaxK · April 1, 2024, 2:33pm

Have you tried a new card, fresh install, and recover from a backup (step 4 in the guide)?

frieck · April 5, 2024, 7:31pm

Yes!
I am running in a new card, with a new NVM-e and in a fresh install from a backup…
I will wait for the next time it fails and have a proper look into the whole .log file.

frieck · April 10, 2024, 12:41pm

My Home Assistant instance recently experienced a complete breakdown, leaving it completely inaccessible. Previous occurrences were glitchy but still allowed access to the app (restarting was enough to “fix” it).

In the home-assistant.log.1 file, I found the following entries:

2024-04-09 22:33:26.096 ERROR (MainThread) [homeassistant.components.wiz] Error fetching LLamp data: Failed to update device at 192.168.1.185
2024-04-09 23:08:37.073 WARNING (MainThread) [aioesphomeapi.connection] sensor-multi-mini-02 @ 192.168.20.33: Connection error occurred: [Errno 104] Connection reset by peer
2024-04-10 00:00:01.993 ERROR (MainThread) [homeassistant.components.speedtestdotnet.coordinator] Error fetching speedtestdotnet data: Unable to connect to servers to test latency.
2024-04-10 00:08:57.303 WARNING (SyncWorker_34) [homeassistant.helpers.frame] Detected that custom integration 'dwains_dashboard' uses deprecated 'SafeLineLoader' instead of 'PythonSafeLoader'...
2024-04-10 01:47:36.279 ERROR (MainThread) [homeassistant.components.websocket_api.http.connection] [546789750464] FRieck from 192.168.1.154: Client unable to keep up with pending messages. Stayed over 1024 for 5 seconds.
2024-04-10 03:17:37.206 ERROR (MainThread) [homeassistant.components.websocket_api.http.connection] [546823375296] FRieck from 192.168.1.154: Client unable to keep up with pending messages. Stayed over 1024 for 5 seconds.

These entries suggest a variety of issues including connection errors, deprecated integrations, and high system load warnings (which is odd), particularly the “Client unable to keep up with pending messages” errors. But there is nothing reporting a crash… could this mean that the SSD was inaccessible (so no log was written)?

Any suggestion on how to mitigate this?

MaxK · April 10, 2024, 1:31pm

The answer to that is here: Client unable to keep up with pending messages · Issue #68030 · home-assistant/core · GitHub

The deprecated integration, dwains_dashboard, should be addressed. I would test removing it to see if that makes the system more stable.

The M.2 SSD can draw a peak of ~ 6 watts. The RPi4 USB can handle ~1.2 Amps max (~6 watts). So, I’m still leery of the power draw of a M.2 SSD connected directly to RPi4. It does not matter how big the power supply is that is powering the RPi4 - the RPi4 has its own limits.

Have you tried using a powered USB hub and plugging your M.2 SSD into that? (RPi4 - > powered USB hub → M.2 SSD).

There is no guarantee that a log file entry will be written during a system crash. I have found that there usually is nothing in the log file related to the actual system crash event. But I look for clues prior to the crash.

Also when there is a system crash, there is a high probability of file system corruption. So, getting the system restarted may not mean that you have corrected the corruption. Do some system checks:

ha core check

ha core rebuild

Or rebuild fresh and restore from a good backup.

frieck · April 10, 2024, 2:35pm

I am going to remove Dwains Dashboard… lets see how it goes.