Post-upgrade system checks

I’m new to HA, but not new to systems management. Given the complexities involved in HA, I would like to write a script that would be run both pre-upgrade and post-upgrade. I would then use a diff tool to compare the outputs for any differences that might indicate an issue that needs to be addressed. I would like to start with a list of installed integrations and some state data about each of them. From there, I would enumerate the devices and entities, again collecting specific data on each. Unfortunately, I cannot find a way to pull integration information via Python / Jinja. Is that possible? If not, is there another way to go about this? Note that integration_entities does not help since, as far as I know, you have to provide the integration name. Thanks.

This is an interesting concept and I don’t have an answer but I am curious to see where it leads.

I have gone through many manual HA post-upgrade system checks:

  1. do all automations still work as expected?
  2. are the logs clear?
  3. are all my devices available?
  4. are the system entities (memory, disk, cpu, etc.) as expected?
  5. etc.

I could imagine a script to check device and entity data pre & post upgrade but it might not be able to flag potential functional issues if the data were equal before and after upgrade. In other words: the device and entity data is ok but how it is handled in automations has changed.

Agreed. There would be some manual checks as well, but in a previous life, I wrote unit tests to do what was referred to as “operational validation” tests. They ran frequently against virtualization, storage, etc. systems as a form of health checking. I would be interested in doing something like that with HA. I just need to figure out how to get to the data.

My post upgrade checks are:

  1. Check my auto-entities card that lists unavailable and unknown entities.
  2. Check the logs for errors and warnings.

Takes lest than a minute.

I already glance at the system performance metrics (cpu% ram, etc…) periodically so usually don’t bother right after an upgrade unless things have gone horribly wrong and my system has slowed to a halt. This has happened once in 6 years and it wasn’t an upgrade that caused it (it was me).

You need to define this a bit more clearly:

Anything other than state values might require a custom integration.

1 Like

I do something similar by checking that unavailable entities = 0 using this template sensor:

1 Like

I don’t need a state history, so sensor seems overkill. Example of use:

Screenshot 2024-07-07 at 11-51-59 Administration – Home Assistant

And the reason:


.
Resolution:

Click entity → Settings → Delete.

Good point. I use the sensor in an automation to notify me when unavailable entities is greater than zero.

1 Like

OK, I will definitely check these out. Thanks.

I have a small HA environment right now, but anticipate it growing. When I upgraded Core from 2024.6.4 to 2024.7.1, I missed the fact that the Aladdin Connect integration had been deprecated. Since the integration simply disappeared after the upgrade, it wasn’t obvious at first. So my idea was to essentially “inventory” the environment (integrations, devices, entities, etc.) before an upgrade and again afterwards. This data might be saved as python objects, written to a file. I would then run diffs against the before-and-after objects to identify any changes. I don’t really know Python, so it would be an effort, but if written properly, it should not require a significant amount of maintenance. And the nice thing about this approach is that it will catch things that disappear for whatever reason, as well as new entities that didn’t exist prior to the upgrade. At the end of the day, perhaps I only need to look at entities.

Interesting concept. Certainly something I’ve done in production environments before.

I’m not sure inventorying entities is the right approach though. I’ve learned to carefully read not only the release notes, but the entire thread for each version. Usually anything that might impact me is pretty well documented, but even when it’s not, someone else has had the problem and posted about it. I doubt I’d miss a deprecated integration.

I document anything suspicious in my “update notes” text file for the new version, which includes a list of things to test after updating. This list only takes a few minutes to run through, during which time I’d notice any other significant failures or changes.

That said, I really like the unavailable entities template. That would be a good back-up to catch things which don’t show up just browsing my dashboards.

1 Like