Preventing or Handling a Bad configuration.yaml restart

Feature Request
I’ll cut to the chase with a few different ideas to mitigate the possibility of a bad configuration.yaml preventing Home Assistant Core from starting:

Idea 1: If I have a bad configuration.yaml and Home Assistant Core errors when restarting, fall back to the “last known good config” and use that to start-up. Also save the ‘bad’ config under an obvious filename and let me know the problem occurred when things start up.

Idea 2: If I have a bad config saved, show that as an active message and/or state sensor, so that I know I’m at risk for a restart of any kind (including power loss) to put me in a problem state of being unable to boot home assistant core.

Idea 3: Add “check configuration” to Home Assistant OS updates the same way they are available for Home Assistant Core. Could do the same for Supervisor, though I don’t believe that requires a Core restart so that may not be necessary. *Note: I use the update.home_assistant_[name] entities and an entity filter card on the dashboard to inform me of updates. I know there is a notification as well, but I’m uncertain if the UI/UX is different in that case. I’ll update this thread when I can confirm that.

Problem Statement
Yesterday, I started an edit to configuration.yaml but decided against the change. By happenstance, I was using a different editor (VS Code Community Addon) which I believe auto-saved and I simply did not realize this.

Later that day … I saw there was a Home Assistant OS update and I kicked that off.

Because of the bad configuration file, after the OS update home assistant core could not restart.

Current State: Steps I took to resolve
I recently switched to Home Assistant Yellow, so when the OS update didn’t work, I initially thought it was the OS update failing.

I do have backups, but I also felt I didn’t want to jump the gun on anything and make it worse. I started by watching the LEDs Home Assistant Yellow Guide and interpreted that I had:

  • Red = On
  • Yellow = Steady cadence of two blinks (which I assume is the heartbeat)
  • Green = Steady cadence of single blinks (problem) and once every several minutes a series of blinks (3-5 I couldn’t be sure) Raspberry Pi Documentation - Configuration gave me some idea

Searching for problems with the OS update, I came across one thread that had reports of failures and need to restore the OS image. I came across another thread that suggested checking http://x.x.x.x:4357/ to see if the OS was really down. Mine was Connected/Supported/Healthy, which was intriguing.

I then brought a keyboard and monitor over to my Home Assistant Yellow, only to promptly realize I needed to connect via a laptop. These instructions for Linux/Mac were helpful: Home Assistant Yellow Guide I first tried with my mac following the linux instructions but quickly discovered this was a case where that wasn’t going to translate.

I flipped over to a Windows laptop and was partially caught up by a missing driver. I could identify the com port anyway, which led to some lost time of trying to connect even without the necessary driver. Once I sorted that out, I was in the terminal.

I tried the commands suggested “ha supervisor logs” and “ha network info” – the sequence of info in supervisor logs were from boot and I couldn’t clearly see the failure point as the “core did not boot, restarting” was a warning I saw, but I didn’t see anything else that was clear to me.

I then explored the available commands and ran “ha core check” this took a bit to run and kicked back an error and the exact line of bad config … I immediately thought “oh THAT? That was the thing I backed out of this morning! Shoot, I bet it saved and I didn’t realize it.”

I knew core was in a container, but I’m less familiar with container structures so after some feeble cd and ls attempts I googled some more and found the exact set of instructions I needed to open the file in the terminal and eliminate the bad config: Edit configuration.yaml with Hass.io CLI - #10 by SmartHomeGuy

ha core restart and I was back up and running. Success, and in complete fairness I made mistakes of not having verified my ssh/terminal option (will further improve that this evening) and I could also take a more immutable approach with backups/restore from backup so I don’t resist that option in the event of a failure.

Regardless, I feel that other users could find themselves in a similar situation with a bad config that prevents core from starting and the ideas (or any others someone has) to further mitigate that from happening is likely a win.