Are HA Backups useless? šŸš® (sqlite db always corrupted)

Most likely the recorder was designed with the intent of moving to a NOSQL database at some point, or to one of the growing number of SQL databases that support JSON as a native format.

This is not a reason for a complete recorder rewrite- it is wat allows the recorder to be flexible in what it stores. To badly translate a Dutch proverb: A relational SQL database is about as flexible as a lead door. This is why NOSQL databases are gaining so much traction.

Edit:

Geeesus, why you write this, now i just got another ā€œnewā€ Option :slight_smile: to look into

I donā€™t know the reasons, but itā€™s obvious that there are quite a few that seems to have ā€œmissedā€ that the Recorder Now Records ALL, so people have to make their choices, depending on their needs/installation etc. Flexibility a High level ā€¦ Number 1 Choice i went for HomeAssistant

To be fair, they do use some native datatypes in the table. I think this was done maybe a year or two ago ? I remember at some point they had to update the db format and migrate the db (which resulted in some cryptic error on my setup and the db being corrupted, but thatā€™s another story). Right now, the states table format looks like this:

image

Lots of varchars (some of which are unavoidable to some extent), a hideous TEXT (see below) and lots of redundant data being stored. Every single state change stores the full entity ID (as text), full domain name (as text too) and, worst of all, a full json dump of all the attributes:

image

A full json text dump for every single state change of every sensor.

The amount of redundancy in that db is insane. Not only will this bloat the db size, it also makes it slow to load. Not so much on the SQL query side, but all that redundant data needs to be parsed back into Python objects. And parsing is slow, and Python is slow. The problem here is, as you said, thereā€™s no easy fix for this besides a full redesign of the recorder system from the ground up. State changes need to identify their sources over short index IDs or GUIDs that link back into another table for the entities and attributes to cut down on both parsing overhead and redundancy.

1 Like

I just did a simple check on a random state table entry on my db for a temperature sensor. A single state, as it is, uses 294 bytes in the db for a single temperature value. The exact size varies by sensor and even state, due to liberal use of variable length cells. If you remove most redundant overhead, keep the states and event IDs, a timestamp, an additional ID to link to the source entity and attribute table and finally, a IEEE float for the value, that data can be compressed into 24 bytes. Thatā€™s 8% of the original size. And technically this could be made even smaller.

1 Like

You do have to take into account that what you are talking about is ONLY the data, and not how this data should be stored, and Not how the system/core and front-end have to call itā€¦ so you canā€™t compare, if you donā€™t take in consideration how an sqlite vs relational-db works ā€¦ i fully agree that there are stored ā€œa-lotā€ , one can reduce this some by not calling every ā€œentityā€ i.e. ( sensor.myawasome_temperature_sensor_on first_floor ), and not include ā€œentitiesā€ from a device that you know youā€™ll never use etc. and ā€œmeasurementā€ and ā€œunit_of_measurementā€ is hardly ā€œredundantā€ in this sqlite-DB ā€¦ and attributes, you donā€™t have much choices as itā€™s individual ā€œinformationā€ from specific-device/sensor, if you call this information ā€œredundantā€, you have to study ā€œAttributesā€ closer, as there are ā€œvariable-stringsā€, ā€œstatesā€ etc. etc. in the Attributes- therefore the TEXT -type
And i just did a ā€œcompact dbā€, which reduced it 1/3, i know we are going towards a bigger and bigger ā€œdefaultā€ Installation, with alot more integrations, and therefore a bigger DB, there are Documentation in regards to this, Recorder etc. Tips on what to exclude, to reduce your installation and keeping it Healthy
A New default ā€œExternalā€ DB is hardly a way forward for a product like HA, a Maria-DB wonā€™t make any different, in the end itā€™s about how the user choose to Use the product ā€¦ default it works for many, for others they HAVE TO follow up on the Upgrade-notes, and yes have to make additional decisions/settings to ā€œRoll-Onā€

PS: and Aron was talking about DB-Size vs Data-Size ā€¦ itā€™s simply wrong, and taken out of a very importent context DB-in-Aplication

Iā€™m afraid you donā€™t really understand how data is stored in databases on a low level.

The backend issues an SQL query over a time range defined by timestamps. Thatā€™s all there is on the call side. The problem is the way the data is stored (the data layout) and the subsequent processing into natively readable objects.

Of course it is. Your unit of measurement for a temperature sensor is not going to change from one sample point to the next. Storing it for every single sample is completely redundant, as are the attributes like friendly name, etc. This is high level structural data that will never change when acquiring data points. It needs to be stored in a separate table (or not at all, friendly name for example can be retrieved from the HA config instead) and indexed from the states.

Data size is directly correlated to DB size and vice-versa in this context. The size of the management data itself is insignificant compared to the size of the stored data, as long as the redundancies are removed. In SQL, every native type has a fixed size in a cell, except for varchars (which are being much too liberally used by HA).

1 Like

i just checked my entities( potential events and states ā€œspammerā€ )
about 300 ( if i use a ā€œnaming strategiā€ that reduce the Name and Entity Name with lets say 15 letters each ) ( 300x30=9000 Bytes) every 10 seconds ( 54000 Bytes a minut ) here we are talking ā€œRedundantā€

i know , but this is not how this db works as i understand, but in a relation-db yes

There i agree, and donā€™t understand, as itā€™s already registered in config-files, as well as device-name ā€¦ thou then, if itā€™s not stored in the event, then the underlying process have to read from these config- files ā€¦ bump, +30% faster to read from db-blob ā€¦ so we are back to ā€œmaintenanceā€ of DB, if we want to have a fast responding UI and core

i see 3 varchers in the example, state( canā€™t be different, unless TEXT) , entity_id ( you want INT here ? ), Domain ( what was it now 3 was ) ā€¦ yes, i dont understand what you want it to be instead :slight_smile:

This has nothing to do with ā€œhow db worksā€. This is about the recorder component choosing how to store data in the DB by splitting it up into appropriate tables and interlinking them in such a way that static data gets only stored once while dynamic data is stored in a highly optimized cell layout linking back to the immutable data.

I noticed. Donā€™t worry about it.

i know the configuration/setup of the db didnā€™t came out of the blue :wink: , and i has also hope on that someone, some day will offer his time and expertise to develop the DB into todayā€™s circumstances/demands ( Maybe even to MY satisfaction/ needs and preferences)( i have lots of idea and ways i want it), but im not in a position to complain about it, and donā€™t really find it appropriate, as i have ā€œoptionsā€ AS everyone else with just a little Computer knowledgeā€™s ā€¦ itsā€™ just a click away, and scrolling a few docs, or dozen :slight_smile:

PS: above would be Edit: 77 760 000 Bytes a day :wink: ( redundant_loooong_names ) ā€¦ start thinking positive, thereā€™s a-lot one can do, beside complaining

I still would like to ā€œseeā€ that query in ā€œactionā€ ā€¦ for i.e. A Switch 1 Basement, with 4 ā€œAttributeā€ ( Total Power Switch 1 Basement (ad-ele) , Current Switch 1 Basement, Voltage switch1 Basement, Power switch 1 Basement ) ā€¦ and your ā€œproposedā€ Database-Model, tables-setup, relation etc ā€¦ and why not an estimated comparing with current model , performance-vice ( I.E (open/close)Total and, read from config-files etc,ā€¦ but of-cause also your use-case, for average user in terms of administration etc. on a ā€œdefault MariaDBā€, if thatā€™s your thought, ā€¦ or ā€œjust a new/rebuildā€ sqlite ā€¦ im sure the "default image, will increase a-little( but we are there already( with all new Feature-Reqests/integrations/Product-support etc. ) implemented lately ( which i assume you also have taken in consideration in your judgement off current DBā€™s expansion, (beside the new entities and automatons you added, during same time , and the fact that Update-note, and ā€œHeads-upā€ , in regards that Recorder Records ALL eventā€™s and State-Changes now, if not ā€œToldā€ other-vice, and to be honest, and why not those ā€œExtraā€ entries you create-in-db, regardless of DB model, when you lower you sensor to ā€œspewā€ every second, instead of 3 or 5 )
PS: I do know how a table look like, have experience from Oracle, MySQL, and now also a-little on how Grafana/Influx wantā€™s to be asked nice ( havenā€™t manged to create a ā€œunfortunate Loopā€ ) ā€¦ yet

I think in my absence everything I was going to reply to was already said. I only get time to visit here occasionally and mainly on weekends.

Just a few points I want to make reallyā€¦

SQLite is a perfectly capable relational database engine, so everything said here regarding changing the database structure to proper relational tables to avoid storing redundant data applies to it as well.

Relational database are by their very nature less flexible. That is a large part of the the point, actually. Storing data within that rigid structure means storing efficiently, predictably and in formats the database engine is optimized to work with.

The HA database is nothing special. It really isnā€™t. It would be trivial to design a relational schema to efficiently hold the data in say, 15 minutes. Probably most reading this thread with some database experience have already done so in their head. (I have).

Relational database engines are optimized to perform best when data is respectful of native data types and broken down to its elements. It is very costly for a database engine to parse a TEXT field of JSON to extract a value (and yes, Iā€™m aware that JSON types now exist in many DB engines like MariaDB, however this is not universal). It is trivial for the DB engine to simply link together a few tables via JOINS. Especially if they are indexed properly.

If anyone is curious about how data can be stored across multiple tables and then combined in a query via relations, hereā€™s a great place to start:

There are actually a lot of decent tutorials out there (I just Googled) to get you working with a relational database and make the concepts much clearer. If you have Microsoft Access, Google up the famous ā€œNorthwind sample databaseā€ or install it directly from Access templates. Itā€™s a relational database which is designed as an example and used in many tutorials.

2 Likes

As Mentioned, there is a big difference in ā€œRead Aboutā€ and Understanding ā€¦ but if you ā€œAlreadyā€ figured it out, i look forward to see Your proposed Model

To be honest, you might be waiting a long time. I havenā€™t created a database schema in anywhere except my head and donā€™t really see the need to put it down in this thread. Primarily because (I hope this does not come through as hostile, I am just lacking a better way to say it) I have nothing to prove. In a way, this is database design 101 and there is no need to prove the assertion that a relational database which eliminates the redundant data and stores data in native data types instead of just VARCHAR and TEXT would be orders of magnitude more efficient. It is is.

Considering how long the database has been an issue, despite database corruption, performance and size threads being a constant in this forum since I began with HA (sometime around version 0.40), fixing the database is of very low priority in development.

I understand, you have really thought it through, maybe you should have looked at various ā€œAttributesā€, tried to understand what those really are, so you knew before you started ā€œthinkingā€ out a solution, and taken in consideration the impact in performance, and additional DB usage use(CPU, RAM) your ā€œsolutionā€ would lead to.

I sincerely doubt a ā€œDictionaryā€ in another table will solve anything, just make it worse, you will brake a-lot functionality ā€¦ AND before you commend any further, please go through your ā€œAttributesā€ , as i recommended you before, and my ā€œkey-wordā€ was then ā€œVariablesā€ ā€¦ ( or maybe i should say ā€œentitiesā€ ā€¦ with state = Boolean, strings(bigger than Varcher) as-well as INT, with multiple ā€œintegration-specificā€ values ā€œunit_of_measurementsā€ Names etc. etc. ) ā€¦ so every time a sensor update states/events(as now in a single ā€œBLOBā€) (open/close), this ā€œdictionaryā€ table(with Rows corresponding to 5-10 times or even more, than entities, if you count device-tracker, etc. etc.) should be updated in all itā€™s related "Attributes-Rows and corresponding Columns(which might have to be a few TEXT column also)
To be honest, I am ā€œastonishedā€ over your ā€œSolutionā€

PS: Some people have problems with their DB, others not, for various reasons

1 Like

Clear answer: no, backups are not useless (anymore :smiley: ). Because:

Seems this is working meanwhile. Had to restore a HA backup and database looks fine so far. Details @ Fix corrupted / malformed SQLite database (home-assistant_v2.db) - #23 by e-raser.

With recent updates, HA has gone to a relational model for attributes and other parts of the database as was described in this thread by myself and others. The performance improvements and data size reduction are impressive.

1 Like

For me backups are still useless. My HA instance is huge, database file has 6GB, no way to back it up and restore it the convenient way. What a bummer, I will take the long way by manually replacing the db files and hope for reliable backup process soon.