I’ve been encountering a recurring problem with my Home Assistant setup, and I’m at my wit’s end trying to solve it. Every 1 or 2 weeks, my Home Assistant machine suddenly stops responding, and upon attempting to restart, I encounter an error stating that there is no configuration.yaml file.
Here’s a quick rundown of the steps I’ve taken to troubleshoot the issue:
Hardware Swaps: I’ve switched out both the machine itself (now using a new RPi 4 with 8GB of RAM) and the NMM-e card, thinking it might be a hardware issue causing the problem.
Power Supply Change: I’ve also switched to an official Argon One power supply for my M.2 board, ruling out potential power supply issues.
Despite these efforts, the problem persists, and I’m left scratching my head. I’ve scoured forums and documentation but haven’t found a solution that addresses this specific issue.
If anyone has encountered a similar problem or has any insights into what might be causing this configuration.yaml disappearance glitch, I would greatly appreciate your help and expertise. Any suggestions for further troubleshooting steps or potential fixes would be invaluable.
Thank you in advance for any assistance you can provide!
I am not sure if the power supply is the issue. That happened with two original power supplies. I still didn’t find any correlation in the logs also… any suggestion on how to find it?
Yes!
I am running in a new card, with a new NVM-e and in a fresh install from a backup…
I will wait for the next time it fails and have a proper look into the whole .log file.
My Home Assistant instance recently experienced a complete breakdown, leaving it completely inaccessible. Previous occurrences were glitchy but still allowed access to the app (restarting was enough to “fix” it).
In the home-assistant.log.1 file, I found the following entries:
2024-04-09 22:33:26.096 ERROR (MainThread) [homeassistant.components.wiz] Error fetching LLamp data: Failed to update device at 192.168.1.185
2024-04-09 23:08:37.073 WARNING (MainThread) [aioesphomeapi.connection] sensor-multi-mini-02 @ 192.168.20.33: Connection error occurred: [Errno 104] Connection reset by peer
2024-04-10 00:00:01.993 ERROR (MainThread) [homeassistant.components.speedtestdotnet.coordinator] Error fetching speedtestdotnet data: Unable to connect to servers to test latency.
2024-04-10 00:08:57.303 WARNING (SyncWorker_34) [homeassistant.helpers.frame] Detected that custom integration 'dwains_dashboard' uses deprecated 'SafeLineLoader' instead of 'PythonSafeLoader'...
2024-04-10 01:47:36.279 ERROR (MainThread) [homeassistant.components.websocket_api.http.connection] [546789750464] FRieck from 192.168.1.154: Client unable to keep up with pending messages. Stayed over 1024 for 5 seconds.
2024-04-10 03:17:37.206 ERROR (MainThread) [homeassistant.components.websocket_api.http.connection] [546823375296] FRieck from 192.168.1.154: Client unable to keep up with pending messages. Stayed over 1024 for 5 seconds.
These entries suggest a variety of issues including connection errors, deprecated integrations, and high system load warnings (which is odd), particularly the “Client unable to keep up with pending messages” errors. But there is nothing reporting a crash… could this mean that the SSD was inaccessible (so no log was written)?
The deprecated integration, dwains_dashboard, should be addressed. I would test removing it to see if that makes the system more stable.
The M.2 SSD can draw a peak of ~ 6 watts. The RPi4 USB can handle ~1.2 Amps max (~6 watts). So, I’m still leery of the power draw of a M.2 SSD connected directly to RPi4. It does not matter how big the power supply is that is powering the RPi4 - the RPi4 has its own limits.
Have you tried using a powered USB hub and plugging your M.2 SSD into that? (RPi4 - > powered USB hub → M.2 SSD).
There is no guarantee that a log file entry will be written during a system crash. I have found that there usually is nothing in the log file related to the actual system crash event. But I look for clues prior to the crash.
Also when there is a system crash, there is a high probability of file system corruption. So, getting the system restarted may not mean that you have corrected the corruption. Do some system checks: