HA stability - experiences. Came here from competitor

Were these breakages mentioned in the release notes and you did not fix them before updating?
Were those core integrations? Can you name some examples?

Also for me an integration breaking is not instability unless it took your whole system down, but that’s just my personal opinion.

1 Like

Yes it’s not unussual that core integration(s) not mentioned in changelog suddenly stop working after update (or at least part of their funcitonality is broken)

I guess someone who considers moving to new platform is interested in reliability in overall not stability per-se.

I could say HA is reliable if you have a luck or don’t uses system under specific conditions which might
cause problems. The few issues I’m aware of:

  1. Supervisor may fail on update crashing whole HA. unfortunatelly Supervisor updates without user control as we know. It happens to people in waves kind of once a year or two.
  2. Communication with NabuCasa breaks without reason. Probably snitun module fails and not recover. Happens about once a year
  3. Transition from DST to normal time crashes half of HA functionality, makes high CPU load, some installations stopped working at all (see todays reports).
  4. Core may DDOS itself and your local network because of DNS issue
  5. Last but not least: all issues with various severitites related to updates. incl impacting components which are not listed in changelogs. It does matter after the fact that all components must be updated at once (with whole HA). It increases a risk of breaking one component while updating another one in order to fix it
  1. One of the reasons why I don’t use supervisor. But still, a failure only once or twice a year is not that much in my view. There are commercial systems that fail more often
  2. Not using Nabu Casa anymore. Again an online system that fails once a year is actually pretty solid, Amazon/Google Services fail more often.
  3. No issues at all here, from what I could see it was mostly caused by time pattern trigger, while I agree that this shouldn’t be the case, I still haven’t seen a single automation that actually needed a time pattern trigger and other triggers would not have been more efficient.
  4. I agree this is an issue discussed heavily in another topic and should not be the case. Again I don’t use supervisor, so can’t comment on that.
  5. I run HA for 5 years now and my system never went haywire due to an update, I also never had the system reatart itself, not even once.

I got a feeling that people expect a military grade reliability system for free. In maybe 95% of the cases I saw here on the forum (and I read quite a lot of them) the issue was not with the system, but with the user and more often than not it was mentioned in the release notes, but people rather ignore them and whine afterwards than taking 15 minutes to read and 10 minutrs to adjust their system before updating.

The “breaking changes” section of the release notes does not list changes which break things (bugs.) It is intended to only list intentional changes which will force some or all users to make configuration changes to keep things working correctly.

For a long time I didn’t know this and thought I’d be OK if I just read the release notes and fixed any breaking changes which applied to me. This is not the case. You also have to monitor these forums and Github to scan for any reported bugs with the specific integrations you use. Or just cross your fingers, update, test everything, and be prepared to back out if necessary.

Exactly. Bundling all integration changes with the core update, then not documenting bugs in one central location is setting users up for failure.

I agree there’s a lot of “pilot error.” I’ve certainly made my share, and folks here have always been kind and helpful in setting me straight.

I do question the 95% figure, but even so, it’s hard to call it whining if someone does read the release notes but is impacted by a bug in an integration which was known but not documented, especially since there’s no way for the average user to back out an integration update without backing out the whole core update.

2 Likes

I ensure you that it happened to installations which are not using pattern triggers.
BTW You are not using most adviced installation type. Consider hight percentage of issues happening to users who opted for HAOS deployment.

It would have been pretty bad if 95% of issues had been done by programmers. But I bet to OP more important is the remaining 5%.
In addition the system which month by month requires extensive maintanance and care from users also hardly can be called “reliable”.

I never had to do this and it’s certainly not the most common thing.

This you should do anyway and always when updating a production system in my opinion.

I was talking about the whining when people update and have an issue/breaking change that was documented in the release notes. For other bugs I agree, it’s certainly not whining. I have been here for quite some time and I helped out quite a lot of people and most of the time it is a documented change.

Yes, because I consider myself an experienced/advanced user who wants to have control over his system as much as possible :slight_smile:

Extensive maintenance is an exaggeration, I updated every month (in the past twice a month) since somewhere around 0.4x and I never had to invest more than 30 minutes incl. reading the relevant breaking changes. If you are not up to invest this time, than HA is not for you. It is still a system for tinkerers and IoT passionate people, it’s not set and forget and it probably will never be regarding the thousands of integrations.

One thing not mentioned here in terms of stability is that every Home Assistant installation comes with a built in ticking time bomb: the database.

Unfortunately the HA database design is, to not mince words, the worst possible way to store data. The “states” table treated as if it is a flat file. Every state is stored as varchar(255), every domain is varchar(64), every entity_id is varchar(255) and to top it off, the attributes column is of text type. Every state change adds a record containing mostly character data and attributes is actually stored as a JSON encoded string containing all the attributes of the state change.

Anyone who knows the basics of database design is facepalming right now after reading this.

This is why the HA database uses 10GB to store 500MB worth of data. Why the database is slow, why there are so many writes, why history graphics spike the CPU to generate, etc. etc.

And left untouched with a default install, this DB is run under SQLite which will happy grow to fill the disk at which point HA will stall, the DB corrupt. And if running on an SD-card, by now the card is already burned out by writes.

HA will never be a reliable platform out of the box until the database is redesigned.

4 Likes

My fairly experience with OH is that developers rush to get features / changes integrated just befor e release of a Milestone or “stable” version without proper testing in a previous snapshot release. Many times this results in breakage in OH releases. With HA, you are free to keep running a stable system. You can wait until others have found issues and patches before upgrading.

I suspect that is a tradeoff between new user friendliness / simplicity vs robustness for large installations.

Actually it’s more complicated and less achievable, because - what I already mentioned - all changes are bundled and released with core at the same time. Thus new version might possibly fix the issue your installation encounter, but possibly it will break something else.

Releasing half-baked/untested features happens to HA too (who knows, maybe OH is even worse in that). Just look at most recent release with new incomplete Tuya integration, removing previous working integration at the same time. Not speaking about Shelly integration and several others.

2 Likes

HS usually tries to document them and quickly releases patches. The OH lead developer told me I had to wait over 3 months to use my newer z-Wave devices that were in the binding database just because they broke backward compatibility, unannounced during a minor version development cycle.

My HA has been stable for pretty much 2 years. There have been 2 exceptions.

  1. Last night’s clock change, which didn’t take down the system as such, but it did affect the reliability of the system. I was lucky that because it is running on a Virtual Server, with plenty of RAM and CPU - it was able to stay up during all the IO thrashing of trying to run multiple automations 20 times ever second for an hour. But in Home Assistant’s defence - Whether it is Sky, Google, Apple, Bank ATM machines - EVERY year some computer somewhere is affected by the clocks changing. And this has been the first time Home Assistant has been affected.

  2. Several releases ago, there was a major Database upgrade upon install of a new version of Home Assistant and bizarrely the developers opted to do the database upgrade FIRST before letting the rest of Home Assistant load, rather than doing the upgrade in the background. This led to many of us being unable to access Home Assistant for anything from 20 minutes for the luck ones, to >18 hours for the unlucky ones. I was around 5 and a half hours. Some complaints, and suggestions of - maybe do the upgrade in the background, and we just accept that no data is stored for the duration of the upgrade was a better option - and that seems to have been accepted, because now database upgrades are done in the background.

Other than those 2 issues - I’ve pretty much had 100% uptime.

1 Like

Very true. But most of the problem can be mitigated by simply not recording every single event and state change by default, and instead allowing/encouraging the user to specify which data to keep. It’s a baby step, but I suspect it is more likely to happen than a (long overdue) total recorder re-write.

I put in a feature request for this some time ago, and it got a bit of interest, but no movement yet. Your vote might help :wink:

If that is the case, the database architecture seems to have been designed to be the worst choice for both scenarios. :wink:

If that was the default, then it would be somewhat of a workaround. As well as aggressive database maintenance by default. But then it would be so easy to, for example, set the option for an entity to “record all states”. With the user unknowing that each state change writes about 1K to the database.

2 Likes

Thank you all for the comments!
I did not expected so much answers with lots of both - positive and negative answers.

For me it looks nothing big has been changed since my investigation 1 year ago. I am running OpenHAB on my RPi 4. HA is running on Synology, but only collecting data, not automatizing anything.
In my opinion, automation should be about security and saving time. If I should read the breaking notes every month because of updates (I like to update a lot :smiley: ), I’ll rather spend my time another way.
I will give it a try someday, but due to lack of time, I will stick with the competitor…

No. Most of the breaking stuff has been in the move from YAML to GUI. I have rarely had to do anything because of a breaking change. Some integrations introduce breaking changes, but mostly because of a change on whatever 3rd party the integration is providing access to, rather than anything on the Home Assistant side of things.

My condolences.

I ran OH for 2 years and was heavily involved supporting the Z-Wave binding there. Development there is an absolute trainwreck unless you are part of the “chosen few”. Features are rushed into milestones and releases without previous snapshot testing causing high potential for instability.

A competitor who expects the users to write core documentation instead of the expert developers doing that.

Good luck!

Not that different here, obviously.

And again: Not that different here, sadly.

But there comes this idiom into my mind: “Don’t look a gift horse in the mouth!” :zipper_mouth_face:

2 Likes

Developers here value & actually act on user input. See the latest Release Notes for some work put in (Tuya) due to user input! I had the OH lead basically tell me I needed to wait 3 months to get any support for the new version of my devices because the Devs broke backward compatibility (unannounced to other devs) during a minor release schedule. They know Z-Wave depended on that compatibility.

1 Like