Help Building Resilience?

I have a situation where the internet is not always so reliable in addition to nightly reboots of all the networking equipment here. I have cron jobs rebooting my HA hardware (shortly after each scheduled network hardware reboot is completed - for any hardware which would affect the HA (hard wired ethernet to 1Gig Fiber service) connection).

However, sometimes the internet goes down at random times - even sometimes just for 5 seconds - which does not seem, to affect HA… but if much longer, then my HA installation is unable to reconnect properly - it has access to the network but many of my integrations have trouble re-establishing their connectivity.

I came up with the idea of a reboot based upon a situation where my syslog contains a line with the phrase “eth0: Link is Down” and then after that appears in a tail of the syslog, then reboot only after the word “ERROR” appears 100 times in said log. I can noodle through how to do this within HA - and although the above log scraping paradigm (using tail with command line sensors) might not be the best way to handle it - anyone else have any ideas - or is there an integration available thay helps with same?

My setup (RPI 4b w/ 8G of Ram booted off a 1TB SSD on fully supported clean installation of HA Supervised on 64bit Debian) is pasted here:

System Information

version core-2022.11.2
installation_type Home Assistant Supervised
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.10.7
os_name Linux
os_version 5.10.0-19-arm64
arch aarch64
timezone America/New_York
config_dir /config
Home Assistant Community Store
GitHub API ok
GitHub Content ok
GitHub Web ok
GitHub API Calls Remaining 4972
Installed Version 1.28.3
Stage running
Available Repositories 1210
Downloaded Repositories 22
AccuWeather
can_reach_server ok
remaining_requests 46
Home Assistant Cloud
logged_in false
can_reach_cert_server ok
can_reach_cloud_auth ok
can_reach_cloud ok
Home Assistant Supervisor
host_os Debian GNU/Linux 11 (bullseye)
update_channel stable
supervisor_version supervisor-2022.11.dev0401
agent_version 1.4.1
docker_version 20.10.21
disk_total 915.4 GB
disk_used 13.6 GB
healthy true
supported true
supervisor_api ok
version_api ok
installed_addons Log Viewer (0.14.0), Samba share (10.0.0), Home Assistant Google Drive Backup (0.109.1), File editor (5.4.2), Duck DNS (1.15.0), AdGuard Home (4.7.5), Terminal & SSH (9.6.1), Core DNS Override (0.1.1), Mosquitto broker (6.1.3), AppDaemon (0.10.0)
Dashboards
dashboards 4
resources 15
views 25
mode storage
Recorder
oldest_recorder_run October 10, 2022 at 9:44 AM
current_recorder_run November 9, 2022 at 9:14 PM
estimated_db_size 1448.00 MiB
database_engine sqlite
database_version 3.38.5

To be honest, I think you might be focusing on the wrong problem here…

But first a clarification: When you say “internet” are you talking about a connection to the outside world or your local network? And if it’s a local network, do you have control over this network (it’s co lnfiguration and devices)? It’s a best practice to keep a home automation system as local as possible. The internet connection then would be mostly about remote access and location tracking.

With a proper functioning local network your devices should have no trouble reconnecting to HA (although there has been specific bugs in some integrations and devices).

If you need to restart all your network gear, and then the computers that are attached to it, every day, then you need to fix your network gear.

2 Likes

It is all my own network and although total overkill since we both had a flood where we almost lost everything until I repaired a broken sump pump while in the basement in knee deep water, and then a week later my boss called and asked me why I had applied for unemployment (meaning someone got my SS# and tried to get benefits in my name) - something changed in my mind (maybe I lost it!) when I yelled upstairs to the wife “this will never ever ever ever happen again”. So there’s a quick release brackets holding in a new sump pump with a water line leading into the pit for testing purposes and a spare pump ready on a shelf nearby that I test every 3 months - and sensors under every single sink, toilet, sub, washing machine, dishwasher, dryer, hot water heater and furnace - then enterprise level networking equipment ruinning with two separate VPN providers whose headquarters are in countries not members of the nine-eyes group I use as vpn client, rotating randomly between the two, selecting locations like Zurich, Romania or spain, then changed every single password I use to 30 digits. All the equipment is rebooted in specific order daily in the middle of the night on purpose after the configs are all automatically backed up, and the RPI rebooted automatically with a cron job a few minutes after any piece of equipment that being rebooted causes it an issue. My main flaw here with hardware and software selection is that all of the leak, door, window and garage door sensors are all yolink, which relies upon the cloud. if they do not come out with a hub which does not rely upon the cloud (which they say they are working on), I will slowly replace each sensor with one that does not rly upon the cloud. Hope that helps you understand my setup. Just this morning I was able to verify that the RPI crontab jobs are rebooting the RPI at just the right time to gracefully handle all the other equipment reboots without any issue.

Lastly, what I was asking was, on top of all of that, if say the network goes down because of an ISP provider issue and then comes back up, I was wondering if there was a way to use something in the logs to determine a good time to reboot the RPI automatically -

No, actually that is by choice - see my other response which explains my setup in more detail. I was looking for a way that my RPI would be able to intelligently reboot only when needed. That would have to be done carefully, in that I have to be able to completely avoid any flawed logic which would make it reboot endlessly due to some bug or weird condition!

I do have 4 backups of the RPI as well, one a hot standby SSD as well as three others on MicroSD cards for good measure. Every backup is done such that there is nothing running on the RPI other than the OS when it is booted up - so I can do whatever I would need to that storage before enabling that automated start up of all the other processes (such as weewx for my weather station etc) as well as HA before a final reboot to use that storage as my “Prod”.

Right now I can live with the scheduled reboots to stay flawlessly working with the other network equipment, but I was thinking it would be great if it could reboot at the correct time if say we lost power to the house and that came back on, or if the ISP had an outage that was then fixed, etc.

It should never need rebooting. I don’t understand why it would.

Because when you are running alot of things from many vendors you have an environment where sometimes things do not reconnect 100% properly when the internet comes down and back up, you might run into something that has a memory leak, sometimes on the fly configuration changes require reboots, firmware updates, etc., etc., etc., Booting up clean also ensures if you ever do have a power outage and when the power comes back you will know it will come back properly. If you don’t reboot something for several years, how do you know it will come back up properly when there is some kind of an outage? etc.

Oh I accept that, maybe I have lost sight of what you are trying to do and/or why you need to do it.

I was considering watchdog changes but in formulating my answer to you I’ve decided to simply automatically reboot when there are > 200 instances of “ERROR” in the syslog within the last 15 minutes which would both allow the resiliency while giving me ample time to intervene if needed. (Able to be away while internet down for 2 hours then comes back while I’m not around, etc…)

1 Like

I don’t know how to put this differently to you: You should focus on fixing the underlying issues and report bugs where necessary. You’re creating a rabbit hole with your approach. Soon, to mention but one issue, you will have to “improve” your automations by patching them to cater for startup conditions: things that will trigger due to unknown, unavailable or other state changes. Frequently rebooting any system or restarting a service simply isn’t a good idea.

I know this might seem harsh to you, but that’s really not it. If you can detail specific issues without the superfluous chatter, myself and others can try to help you with the real issues as best as we can.

1 Like

I agree 100%. The OP has serious network issues that need to be resolved. Losing internet should not require a reboot of all network hardware.

My Dream Machine Pro has been online for 7 weeks and was only shut down at that point because I was moving things around in my rack.

Exactly. And it could be as simple as replacing a bad power supply, but I need to reign myself in not to speculate more without some hard info. Electricity fluctuations can cause all kinds of havoc.

Fair enough, I am game. But it’s a can of worms! Ok let’s open it! Currently there are no syslog errors occuring. First - here is my setup (below), and second I will next post the syslog errors that are appearing 5 minutes AFTER I complete task #4: 1. Unplug the ethernet cable from my RPI, 2. wait 5 minutes, and 3. plug it back in, 4. wait 5 minutes!

System Information

version core-2022.11.2
installation_type Home Assistant Supervised
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.10.7
os_name Linux
os_version 5.10.0-19-arm64
arch aarch64
timezone America/New_York
config_dir /config
Home Assistant Community Store
GitHub API ok
GitHub Content ok
GitHub Web ok
GitHub API Calls Remaining 4863
Installed Version 1.28.3
Stage running
Available Repositories 1210
Downloaded Repositories 22
AccuWeather
can_reach_server ok
remaining_requests 29
Home Assistant Cloud
logged_in false
can_reach_cert_server ok
can_reach_cloud_auth ok
can_reach_cloud ok
Home Assistant Supervisor
host_os Debian GNU/Linux 11 (bullseye)
update_channel stable
supervisor_version supervisor-2022.11.dev0401
agent_version 1.4.1
docker_version 20.10.21
disk_total 915.4 GB
disk_used 14.8 GB
healthy true
supported true
supervisor_api ok
version_api ok
installed_addons Log Viewer (0.14.0), Samba share (10.0.0), Home Assistant Google Drive Backup (0.109.1), File editor (5.4.2), Duck DNS (1.15.0), AdGuard Home (4.7.5), Terminal & SSH (9.6.1), Core DNS Override (0.1.1), Mosquitto broker (6.1.3), AppDaemon (0.10.0)
Dashboards
dashboards 4
resources 15
views 25
mode storage
Recorder
oldest_recorder_run October 13, 2022 at 9:44 AM
current_recorder_run November 12, 2022 at 4:54 AM
estimated_db_size 1448.00 MiB
database_engine sqlite
database_version 3.38.5

You should be able to use different methods to handle your routines.
Easiest one is probably ping sensors to the first hop outside your network and if the ISP use CGNat then also to the first one outside that.
A ping to google (8.8.8.8), cloudflare (1.1.1.1) or other known and always online site might be an extra step to go.

You can here also add sensors from your network gear, where you router should provide status of ports, including the WAN port.
SNMP might be an easy way to extract this info.

An extra safety check could be relying on Integrations & Notifications | UptimeRobot

Update: As soon as I unplugged it, it must have locked up. First, after waiting longer that the 5 minutes in #4, It was completely inaccessable, so I had to pull the power and then plug it back in to reboot it. Then when checking in the logs - there was absolutely nothing in the log from the moment I unplugged it from the ethernet cable. When HA finally started up after the plug pulling reboot, everything was working fine except for some command line sensors that look at the syslog - in the log itself appears “binary match found” which means a command line sensor call to GREP syslog thought the log was binary. Just deleting the log so any corruption in it is gone resolved that issue. So I had to delete the syslog and restart the RPI (properly with a shutdown -r command) - only then everything came back normally.

My hard-coded (crontab) logged and emailed reporting of scheduled reboots are working perfectly but over and above that, if there is any issue - my RPI is vulnerable in that way… I am surprised watchdog didnt make it reboot. Anyway, so my idea of log scraping won’t work either in a case like this… UGH

Maybe my 200 ERROR count idea PLUS something else. Jeez

Thoughts?

The HA log from the previous session is rename to home-assistant.log.1 on startup, so you need to actually go into the folder and read it with a text editor or copy it out for opening on another computer.

1 Like

I had no idea HA did this. Since when?

From my very vague recollection, 6 months -ish.

Edit, I think since Aug 2021 Change logging to do rollover() instead of rotate() by janiversen · Pull Request #55177 · home-assistant/core · GitHub

Yes, the log in the home assistant directory will be rolled, but if you view it with journalctl -f homeassistant (e.g. on Debian) it will be continuous.

@KruseLuds lets try to get HA out of the mix for a moment to get rid of confounding issues. Make a job that pings e.g. Google once a minute and pipe the output to a file. Or, just open a terminal and run a continuous ping and break your network again. If the ping doesn’t come back, we know this issue has nothing to do with HA directly.