Database size (on two days old stock HA over 1 GB) and how to proceed

So Folks,

I had a hickup with HA two days ago. My VM host did run out of space, just a bunch of MBs before the VM would have run out, during an HA update. I fixed that host issue, and for good measures resized the stock .vdi file to double size too, and by this destroyed my partition table and lost the whole machine. I got my VBox system on that just 3 sub-versions spanning bug in the VirtualBox core that made the resize move the blocks, yeeha. As luck would have it.

So now Im back with an all new 7.4 .vdi, and just up 2 days and only a bunch (about 28) of tasmota switches integrated, no non stock data storage. And my DB is already over 1 GB and my backups are growing fast.

image
and
image

Well, this time I run samba backup, so no significant space will be taken up by the 7 local backups, and my backup location will not run out of space in the next years, even if I keep adding those backups on a daily basis almost indefinitely, but I got it set to keep only 70 anyway.

Also this way I will never lose my integration again. :slight_smile: as my backup gets an ex-house rsnapshot ever night. :smiley:

SO the QUESTION is: when I start adding weather data, and some stuff that I want stored for years, and not just the about 10 days I think HA does on stock settings. Where will my DB end? And would a MariaDB switch improve that issue? How about the maria DB to fix a performance issue?

For space reasons, I switched the HA disc from SSD to conventional hard drive, as my VM SSD environment is kind of limited with its 2x 512 GB raid 1 setup, while with the conventional drives I got plenty with 4x 20tb in a raid 10 config. But I cant increase the SSD as my space, sata-ports as well as physically, is kind of used up in the HP ProLiant micro gen8. I only could go with the sata ports on the mainboard and they are really slow compared to the Hardware Raid controller. Hence the switch to spinning discs.

Did not see any speed changes as of yet but could add just a second VM drive for the DB only on the SSD environment, if that would help the speed. If so, what size should I go with?

I had quite some performance issues with graphs last time. Kinde of they took 10 to 30 seconds to load in lovelace dashboard. And thats annoying if you just want to take a short look at the temps of last night, to know if you would need to switch on the preheat of the car to defrost it (yes im upgrading that to afterburner to automate it, but currently its still manual.)

Whats your thoughts and experiences with that guys?
THanks

For recent data I stick to the default 10 days or so … Long term data is pushed to influxdb and grafana

Personally I don’t use the inbuilt databases for querying large data sets as they’re not designed for analytics

Also try to reduce the number of entities that you have, this will certainly help reduce the local database size

Hope this helps

1 Like

Quite simply, the database design for Home Assistant is an absolute disaster and is not suitable for use, let alone storing data you actually want.

Give up the dream of using it for anything other than 1 - 2 weeks of data (and even then it will grow to multiple GBs) and instead use another database for long term storage. I believe that InfluxDB is the popular choice for long term data storage.

In case you are curious, the HA DB is treated like a flat file. Every state change creates a record where the state is stored as a VARCHAR(256) (seriously) and all attributes are dumped in a JSON string into a field stored as TEXT (yes, really).

And don’t expect the database in those backups to be useful. Often, the SQLite DB is corrupt beyond repair if restored from backup as the file was backed up during writes with open “transactions” (not that transactions are actually used, they are not).

1 Like

Im not sure where you are “heading”, or what you mean, you do know sqlite is “called” a “flat-file-db”, but that doesn’t mean it’s actually “flat” … or what do you mean by “treated like a flat file” ?

Yes i believe that’s how a “Raw” data, record should be stored, in this case, in a non-relational DB, or what’s your idea ?

Why would HA try to store the history data that way? I thought it obvious that this would get performance issues on big setups.

Knowing this now, I see why I have those performance issues with displaying graphs of more than say half a day.

Doing a CLI based file search to query data cant be producing performance records.

I would never have imagined that HA does use such a db design… I really can say im totally flabbergasted.

Thanks for the Influx hint, I was already considering MariaDB.

So what is the user case difference between Influx and Maria here on HA?

I know influx is a special design for time to value data and as such the perfect match for our HA data while Maria is just the usual SQL.

The thing is, I have a Maria already up and running for other stuff including snapshots and so on, I would just need to add a homa-ssistant-data db and the user for HA.

So what is the difference between the HA data going to Influx and Maria?

I found this, but its not clear to me what the performance issues would be.

Performance wise and DB size wise.
THX

btw. the screenshot for the backup size updated for today’s value

image

My backups, just after 4 days up-time are at past 1 GB compressed size. Thats just insane for states of a bunch of Tasmota devices who send MQTT updates 6 times a Minute to get reasonable consumption information about the connected devices.

image

This is a 4 days old HA system DB size

do you mean by “deactivating” entities whose data I likely would not need?

how do you show grafana in lovelace? I did read that iframe wont update itself, only if you reload the lovelace dashboard, but you cant do that if you got just one single page on a tablet to show the local information for say a room

I should have been a bit clearer and said “flat text file”. The typical meaning of which is a text file where every new entry is appended to the end. Such as a log file. This show HA uses the database in the sense that all it does is append the last state change to the STATES table with a record that contains the state stored as VARCHAR, all the attributes for that state stored as a TEXT. It is absolutely brutal in terms of storage space use as I do not believe (I couldn’t find any evidence in a quick Google) that MySQL uses any sort of compression on TEXT.

In terms of how raw data is stored, it depends one’s goal. If forensic preservation is required, then data should be stored as “raw” as possible. But we don’t really care about that in HA. We care about using the data to show histories. So data should be stored in the most efficient way as possible.

Store numeric data as a numeric data type, string data as a string data type, and eliminate entirely the “attributes” column. Attributes should be stored in three linked tables with one containing the attribute definition, another the actual attribute (as its native data type) and another table linking them together and to the states table.

It should not non-relational database because that leads to the storing of loads of repeated data.

Yeah, I have no idea why the database ended up like this. If I had to guess (one could go back on the previous versions to confirm) I’d say the database was added with the recorder component to just keep track of states. Then attributes were added in the easiest way by just dumping the JSON into the DB. And there it sits.

Influx is not used for the recorder, but states/attributes are pushed to Influx for long term storage. The recorder database is unchanged and you can still use SQLite or other, but now you can store a minimal dataset by excluding entities and setting a short purge interval to try to keep DB bloat under control.

let me summarize what I gaind from your posts.

influx is for long term storage of say weather or sensor values

and for recorder, choose either stock sqlite or maria, it does not really matter, there is a rumor maria does boost the performance a bit…

parse history via grafana from the influx db for the frondend or other use and use the sqlite or maria for decision making in e.g. automation only.

did I get that right?

So this basically means you want Ha to go over to MariaDB, for it’s core, front-end, integrations, user add-ons/integrations etc.( i.e total rebuild product ) ?, but i really can’t see how that ( data should be stored in the most efficient way as possible ) … Attribute-date, and other “mixed” Data-type needs to be stored in best ways yes ( i think you know, but actually you don’t ), … You want first "split (from state) the “attributes” "store this in 3 additional tables, 1 for “definition” ( whatever you mean by that) … 1 for the “actual” Attribute (as its native data type) witch is ? VARCHAR or INT, and could even be TEXT ( if it doesn’t “fit” into VARCHER or INT), and then …1 table that links them together and to the " States-table ", i don’t really see where in this equation you gain Speed and efficiency , compared to “Read and Write” a BLOB of string/strings in a TEXT -type Column ?

I think the Database was “added” somewhere in the initial, software planning stage

No, not at all. SQLite is a relational DB engine which can trivially handle this.

Regarding your other comments, I think in the other thread they have been covered. Linking tables together in a relational database is far more efficient than storing everything as TEXT and VARCHAR.

yeah ok , and i still haven’t seen that DB/Table models you and others “refers to”, i.e weather-entity(attributes) camera_device-entity(attributes) , stock_market-entity(attributes) etc. etc. , … take a look in your own “entities” Attributes (i.e. in Developers Tools), … then write down your "Proposed DB/Table model, ( With NO (or very small/tiny) TEXT / Varcher columns … before anyone can present this(or understand it), they should not complain or even suggest anything … AND do take in consideration, that weekly/monthly new integrations/device(with their own specific Attributes) will be Added to HA, by “external” Developers

Regarding new integrations with their specific attributes, this is somewhat of a non-problem. Programatically an easy problem to solve. The “attributes” table can simply be kept in memory in whatever the Python equivalent of a collection or dictionary object is. Name-value pairs. When attributes are written to the table linking them to entities, the ID of the attribute is pulled from that collection/dictionary and used based upon the key (the friendly attribute name). Now if a new integration/entity is introduced with a unique attribute, it will not be found in the dictionary/collection. At that point, it is simply inserted into the “attributes” table and the new ID retrieved. Or simply it is written to the database and the entire dictionary/collection is refreshed in a single fast SELECT query. This happens fairly rarely so has no impact upon performance.

This is one of the many standard methods how this is handled. Anyone who has worked with databases will be familiar.

1 Like