Something you might want to consider as part of this discussion are the use cases. When I choose a DB technology, I use various factors. This might be time series data, but if it’s not frequently referenced as such, then you don’t gain (necessarily) by storing it in a time series DB. Time series DBs are usually fast on insert, so that is a good thing for HA.
Absolutely, those are you starting point requirements. What do we expect to store here. You could argue, things have just naturally evolved and the needs have changed without formally realizing this.
So what do we see right now? The database serves for two purposes. One for configuration management (in parallel/supplementing the yaml files). And for the recorder component.
I did put in the topic, the first word, that I would be questioning the recorder So lets focus on that. As I agree, that using SQLite even for configuration items is a no-brainer (if we use a database for this purpose). Heck, splitting these two apart makes even sense here.
So looking at the recorder it’s main purpose seems to be recording (historical) data of sensors/events, right? Arguing that we only store ‘slowly changing data’ e.g. sun/moon times etc quickly becomes moot. But I fully admit, I only know my use cases. So far, my home-assistant useage is very limited. I have my car (renault) connected, and get sensor data there. I also have a ‘slimmelezer’ connected, but I still only have a feraris meter ;p
So looking at the ‘energy monitoring’ use-case, which is very prominently part of the default installation, the time-series use-case becomes even stronger.
TimeSeries databases are designed for this use-case I would argue, so it’s not unlogical to think into this direction.
Perhaps also state what you think is lacking using the existing solution (and yes, I know you are making an enquiry at this point in time). Are there particular performance issues? Do you want to improve specific way data can be retrieved or otherwise processed?
So lets focus for now on a single use-case, energy-monitoring. This is an area HA seems to pull attention towards. A standard setup would store time-series (sensor) data of energy consumption in an SQLite database (Yes, this is still time series data, even if we do not format it as such and use it as such, the nature of the data is time series).
So lacking here, firstly is the fact that we cause a huge number of unnecessary writes. Again, lets assume a raspberry pi setup with an SD card. Not unlikely (even if not recommended), but also a hosted server with SSD storage. I think I don’t have to go into detail why it’s bad, but for those unaware, all these writes (even if they are somewhat cached in RAM first) for each datapoint, could cause a (small) write (with huge impact on SD wear). On a server farm, this also gets amplified by a huge number of incomming writes, there is a cost associated to this fact (This could financially impact nabu-casa).
Secondly querying the data can become problematic if a lot of data is written, because also reads are inefficient (maybe that’s why we have min/max/avg in the table?), but the default setup purges data after 10 days, right? However we have a statistical data model now, where we store summary data of those 10 days, right? I haven’t dug into how this is done exactly, but I also saw that for now, it doesn’t matter, because we use the same inefficient storage methodolgy. So after X years of writing a lot of aggregated data, we have a messy and slow database.
You also mention things that are not specific to the DB tech chosen are the “duplicates” you noted. Better to keep the issues apart.
Technically correct, but conceptionally I don’t think it matters. The current database design decisions hint towards a solution.
But to make it more clear, I am actually not suggesting to use influxdb for the recorder, the database tech is too different. It could/should certainly be possible in the future of course, to have the recorder expanded to also use influxdb.
I’m actually suggesting (based on my assumptions and lack of current knowledge, hence the initial questions I asked) that we first re-organize the database/tables to be time-series oriented. E.g. to dumb it down, a simple table with just 3 fields. timestamp, metadata_id, value
. Secondly, and this is an important database design decision, evaluate if it makes sense to keep all sensor data in a single table, or have a table per sensor, which would reduce the columns to just two timestamp, value
. The metadata_id is now moved to the table name. The choice here is whether the dominant query is based on timestamp or metadata_id.
Let me explain (even goes a little to deep for now). If we want to query all sensors on timestamp X, having a single table where we can state ‘select * from statistics where timestamp between X-1, X+1’; In that case, its best to have single table. I can’t imagine this usecase too much however, and imagine it more likely we just want to get graphs/historical data on a single sensor (e.g. select * from metadata_id
. The other thing is still possible, but require more complex joins. But due to my ignorance, I don’t even know if this ‘cross-data query’ is currently even possible with recorder
. I fully understand btw, that the current database design is what it is due to reasons, which I was asking about in my start-post. It does influence of course which columns to put in there.
Lastly, remember that SQLite is the default DB, which makes it super easy for new users to get going. I’m pretty sure that is an important aspect here. Choosing another, less general (albeit more powerful), DB technology might create an unwanted barrier. This is probably why the third-party setup for an Influx DB would be preferred: It’s optional and for power users, because the existing setup strikes a good compromise.
Yes/no/maybe. With that I mean, SQLite is certainly easier for newcommers. sqlite3 file.db
and get going. Use a nice gui, no extra tooling required. But also, newcommers I’d expect to use the recorder
as the interface to the database, it abstracts the database doesn’t it? Regardless, as you rightfully mentioned, if we get the database format in better shape, the database tech is still irrelevant (as you rightfully pointed out).
Now assuming our database is the 3 columns I mentioned above, with regards to database tech, I’d be proposing to add support for timescaledb. I also understand that this is something most are unaware of, so let me explain it a bit.
Timescaledb is an extension to postgresql. It works with pure ‘standard’ SQL commands. This is important, because it means that pySQLAlchemy can still be used. So what makes timescaledb differnt? Actually nothing. You use your postgresql database as normal. E.g. you can run a timescaldb enable postgresql instance as normal (without any timescaledb advantages).
The addition to recorder
would upon detecting/requesting the table would need to be converted. So what does that mean? SELECT create_hypertable('conditions', by_range('timestamp'), migrate_data => true);
After that, the table still has the same columns, the queries are still the same, but now timescaledb can do its thing.
There’s a bit more, with regards to creating indexes (if not existing) e.g. (and now I’m way over my head, but as I understood it) to have an index on a view where you have multiple data types in the same table (like we do now). But also aggregation. Timescale can do aggregation ‘behind the scenes’ automatically. And a few other things (Things like averaging uppon an new data entry).
But! These would be optimizations to the current SQLite solution! The SQLite solution shall, and will, function as normally. But, in the case of aggregation, it would need to be done manually for SQLite, Mariadb etc.