Help Building Resilience?

KruseLuds · November 10, 2022, 3:12am

I have a situation where the internet is not always so reliable in addition to nightly reboots of all the networking equipment here. I have cron jobs rebooting my HA hardware (shortly after each scheduled network hardware reboot is completed - for any hardware which would affect the HA (hard wired ethernet to 1Gig Fiber service) connection).

However, sometimes the internet goes down at random times - even sometimes just for 5 seconds - which does not seem, to affect HA… but if much longer, then my HA installation is unable to reconnect properly - it has access to the network but many of my integrations have trouble re-establishing their connectivity.

I came up with the idea of a reboot based upon a situation where my syslog contains a line with the phrase “eth0: Link is Down” and then after that appears in a tail of the syslog, then reboot only after the word “ERROR” appears 100 times in said log. I can noodle through how to do this within HA - and although the above log scraping paradigm (using tail with command line sensors) might not be the best way to handle it - anyone else have any ideas - or is there an integration available thay helps with same?

My setup (RPI 4b w/ 8G of Ram booted off a 1TB SSD on fully supported clean installation of HA Supervised on 64bit Debian) is pasted here:

System Information

version	core-2022.11.2
installation_type	Home Assistant Supervised
dev	false
hassio	true
docker	true
user	root
virtualenv	false
python_version	3.10.7
os_name	Linux
os_version	5.10.0-19-arm64
arch	aarch64
timezone	America/New_York
config_dir	/config

Home Assistant Community Store

GitHub API	ok
GitHub Content	ok
GitHub Web	ok
GitHub API Calls Remaining	4972
Installed Version	1.28.3
Stage	running
Available Repositories	1210
Downloaded Repositories	22

AccuWeather

can_reach_server	ok
remaining_requests	46

Home Assistant Cloud

logged_in	false
can_reach_cert_server	ok
can_reach_cloud_auth	ok
can_reach_cloud	ok

Home Assistant Supervisor

host_os	Debian GNU/Linux 11 (bullseye)
update_channel	stable
supervisor_version	supervisor-2022.11.dev0401
agent_version	1.4.1
docker_version	20.10.21
disk_total	915.4 GB
disk_used	13.6 GB
healthy	true
supported	true
supervisor_api	ok
version_api	ok
installed_addons	Log Viewer (0.14.0), Samba share (10.0.0), Home Assistant Google Drive Backup (0.109.1), File editor (5.4.2), Duck DNS (1.15.0), AdGuard Home (4.7.5), Terminal & SSH (9.6.1), Core DNS Override (0.1.1), Mosquitto broker (6.1.3), AppDaemon (0.10.0)

Dashboards

dashboards	4
resources	15
views	25
mode	storage

Recorder

oldest_recorder_run	October 10, 2022 at 9:44 AM
current_recorder_run	November 9, 2022 at 9:14 PM
estimated_db_size	1448.00 MiB
database_engine	sqlite
database_version	3.38.5

parautenbach · November 10, 2022, 5:45am

To be honest, I think you might be focusing on the wrong problem here…

But first a clarification: When you say “internet” are you talking about a connection to the outside world or your local network? And if it’s a local network, do you have control over this network (it’s co lnfiguration and devices)? It’s a best practice to keep a home automation system as local as possible. The internet connection then would be mostly about remote access and location tracking.

With a proper functioning local network your devices should have no trouble reconnecting to HA (although there has been specific bugs in some integrations and devices).

nickrout · November 10, 2022, 6:38am

If you need to restart all your network gear, and then the computers that are attached to it, every day, then you need to fix your network gear.

KruseLuds · November 10, 2022, 12:51pm

It is all my own network and although total overkill since we both had a flood where we almost lost everything until I repaired a broken sump pump while in the basement in knee deep water, and then a week later my boss called and asked me why I had applied for unemployment (meaning someone got my SS# and tried to get benefits in my name) - something changed in my mind (maybe I lost it!) when I yelled upstairs to the wife “this will never ever ever ever happen again”. So there’s a quick release brackets holding in a new sump pump with a water line leading into the pit for testing purposes and a spare pump ready on a shelf nearby that I test every 3 months - and sensors under every single sink, toilet, sub, washing machine, dishwasher, dryer, hot water heater and furnace - then enterprise level networking equipment ruinning with two separate VPN providers whose headquarters are in countries not members of the nine-eyes group I use as vpn client, rotating randomly between the two, selecting locations like Zurich, Romania or spain, then changed every single password I use to 30 digits. All the equipment is rebooted in specific order daily in the middle of the night on purpose after the configs are all automatically backed up, and the RPI rebooted automatically with a cron job a few minutes after any piece of equipment that being rebooted causes it an issue. My main flaw here with hardware and software selection is that all of the leak, door, window and garage door sensors are all yolink, which relies upon the cloud. if they do not come out with a hub which does not rely upon the cloud (which they say they are working on), I will slowly replace each sensor with one that does not rly upon the cloud. Hope that helps you understand my setup. Just this morning I was able to verify that the RPI crontab jobs are rebooting the RPI at just the right time to gracefully handle all the other equipment reboots without any issue.

Lastly, what I was asking was, on top of all of that, if say the network goes down because of an ISP provider issue and then comes back up, I was wondering if there was a way to use something in the logs to determine a good time to reboot the RPI automatically -

KruseLuds · November 10, 2022, 12:59pm

No, actually that is by choice - see my other response which explains my setup in more detail. I was looking for a way that my RPI would be able to intelligently reboot only when needed. That would have to be done carefully, in that I have to be able to completely avoid any flawed logic which would make it reboot endlessly due to some bug or weird condition!

I do have 4 backups of the RPI as well, one a hot standby SSD as well as three others on MicroSD cards for good measure. Every backup is done such that there is nothing running on the RPI other than the OS when it is booted up - so I can do whatever I would need to that storage before enabling that automated start up of all the other processes (such as weewx for my weather station etc) as well as HA before a final reboot to use that storage as my “Prod”.

Right now I can live with the scheduled reboots to stay flawlessly working with the other network equipment, but I was thinking it would be great if it could reboot at the correct time if say we lost power to the house and that came back on, or if the ISP had an outage that was then fixed, etc.

nickrout · November 10, 2022, 9:39pm

It should never need rebooting. I don’t understand why it would.

KruseLuds · November 11, 2022, 4:22pm

Because when you are running alot of things from many vendors you have an environment where sometimes things do not reconnect 100% properly when the internet comes down and back up, you might run into something that has a memory leak, sometimes on the fly configuration changes require reboots, firmware updates, etc., etc., etc., Booting up clean also ensures if you ever do have a power outage and when the power comes back you will know it will come back properly. If you don’t reboot something for several years, how do you know it will come back up properly when there is some kind of an outage? etc.

nickrout · November 11, 2022, 9:02pm

Oh I accept that, maybe I have lost sight of what you are trying to do and/or why you need to do it.

KruseLuds · November 12, 2022, 3:58am

I was considering watchdog changes but in formulating my answer to you I’ve decided to simply automatically reboot when there are > 200 instances of “ERROR” in the syslog within the last 15 minutes which would both allow the resiliency while giving me ample time to intervene if needed. (Able to be away while internet down for 2 hours then comes back while I’m not around, etc…)

parautenbach · November 12, 2022, 7:03am

I don’t know how to put this differently to you: You should focus on fixing the underlying issues and report bugs where necessary. You’re creating a rabbit hole with your approach. Soon, to mention but one issue, you will have to “improve” your automations by patching them to cater for startup conditions: things that will trigger due to unknown, unavailable or other state changes. Frequently rebooting any system or restarting a service simply isn’t a good idea.

I know this might seem harsh to you, but that’s really not it. If you can detail specific issues without the superfluous chatter, myself and others can try to help you with the real issues as best as we can.

sparkydave · November 12, 2022, 7:12am

I agree 100%. The OP has serious network issues that need to be resolved. Losing internet should not require a reboot of all network hardware.

My Dream Machine Pro has been online for 7 weeks and was only shut down at that point because I was moving things around in my rack.

parautenbach · November 12, 2022, 7:25am

Exactly. And it could be as simple as replacing a bad power supply, but I need to reign myself in not to speculate more without some hard info. Electricity fluctuations can cause all kinds of havoc.

KruseLuds · November 12, 2022, 2:19pm

Fair enough, I am game. But it’s a can of worms! Ok let’s open it! Currently there are no syslog errors occuring. First - here is my setup (below), and second I will next post the syslog errors that are appearing 5 minutes AFTER I complete task #4: 1. Unplug the ethernet cable from my RPI, 2. wait 5 minutes, and 3. plug it back in, 4. wait 5 minutes!

System Information

version	core-2022.11.2
installation_type	Home Assistant Supervised
dev	false
hassio	true
docker	true
user	root
virtualenv	false
python_version	3.10.7
os_name	Linux
os_version	5.10.0-19-arm64
arch	aarch64
timezone	America/New_York
config_dir	/config

Home Assistant Community Store

GitHub API	ok
GitHub Content	ok
GitHub Web	ok
GitHub API Calls Remaining	4863
Installed Version	1.28.3
Stage	running
Available Repositories	1210
Downloaded Repositories	22

AccuWeather

can_reach_server	ok
remaining_requests	29

Home Assistant Cloud

logged_in	false
can_reach_cert_server	ok
can_reach_cloud_auth	ok
can_reach_cloud	ok

Home Assistant Supervisor

host_os	Debian GNU/Linux 11 (bullseye)
update_channel	stable
supervisor_version	supervisor-2022.11.dev0401
agent_version	1.4.1
docker_version	20.10.21
disk_total	915.4 GB
disk_used	14.8 GB
healthy	true
supported	true
supervisor_api	ok
version_api	ok
installed_addons	Log Viewer (0.14.0), Samba share (10.0.0), Home Assistant Google Drive Backup (0.109.1), File editor (5.4.2), Duck DNS (1.15.0), AdGuard Home (4.7.5), Terminal & SSH (9.6.1), Core DNS Override (0.1.1), Mosquitto broker (6.1.3), AppDaemon (0.10.0)

Dashboards

dashboards	4
resources	15
views	25
mode	storage

Recorder

oldest_recorder_run	October 13, 2022 at 9:44 AM
current_recorder_run	November 12, 2022 at 4:54 AM
estimated_db_size	1448.00 MiB
database_engine	sqlite
database_version	3.38.5

WallyR · November 12, 2022, 2:46pm

You should be able to use different methods to handle your routines.
Easiest one is probably ping sensors to the first hop outside your network and if the ISP use CGNat then also to the first one outside that.
A ping to google (8.8.8.8), cloudflare (1.1.1.1) or other known and always online site might be an extra step to go.

You can here also add sensors from your network gear, where you router should provide status of ports, including the WAN port.
SNMP might be an easy way to extract this info.

An extra safety check could be relying on Integrations & Notifications | UptimeRobot

KruseLuds · November 12, 2022, 3:08pm

Update: As soon as I unplugged it, it must have locked up. First, after waiting longer that the 5 minutes in #4, It was completely inaccessable, so I had to pull the power and then plug it back in to reboot it. Then when checking in the logs - there was absolutely nothing in the log from the moment I unplugged it from the ethernet cable. When HA finally started up after the plug pulling reboot, everything was working fine except for some command line sensors that look at the syslog - in the log itself appears “binary match found” which means a command line sensor call to GREP syslog thought the log was binary. Just deleting the log so any corruption in it is gone resolved that issue. So I had to delete the syslog and restart the RPI (properly with a shutdown -r command) - only then everything came back normally.

My hard-coded (crontab) logged and emailed reporting of scheduled reboots are working perfectly but over and above that, if there is any issue - my RPI is vulnerable in that way… I am surprised watchdog didnt make it reboot. Anyway, so my idea of log scraping won’t work either in a case like this… UGH

Maybe my 200 ERROR count idea PLUS something else. Jeez

Thoughts?

WallyR · November 12, 2022, 7:08pm

The HA log from the previous session is rename to home-assistant.log.1 on startup, so you need to actually go into the folder and read it with a text editor or copy it out for opening on another computer.

sparkydave · November 13, 2022, 8:58am

I had no idea HA did this. Since when?

nickrout · November 13, 2022, 9:19am

From my very vague recollection, 6 months -ish.

Edit, I think since Aug 2021 Change logging to do rollover() instead of rotate() by janiversen · Pull Request #55177 · home-assistant/core · GitHub

parautenbach · November 13, 2022, 1:33pm

Yes, the log in the home assistant directory will be rolled, but if you view it with journalctl -f homeassistant (e.g. on Debian) it will be continuous.

parautenbach · November 13, 2022, 1:35pm

@KruseLuds lets try to get HA out of the mix for a moment to get rid of confounding issues. Make a job that pings e.g. Google once a minute and pipe the output to a file. Or, just open a terminal and run a continuous ping and break your network again. If the ping doesn’t come back, we know this issue has nothing to do with HA directly.