Recorder stores data inefficiently, timeseries logging probably better?

oliv3r · June 18, 2024, 10:02am

Something you might want to consider as part of this discussion are the use cases. When I choose a DB technology, I use various factors. This might be time series data, but if it’s not frequently referenced as such, then you don’t gain (necessarily) by storing it in a time series DB. Time series DBs are usually fast on insert, so that is a good thing for HA.

Absolutely, those are you starting point requirements. What do we expect to store here. You could argue, things have just naturally evolved and the needs have changed without formally realizing this.

So what do we see right now? The database serves for two purposes. One for configuration management (in parallel/supplementing the yaml files). And for the recorder component.

I did put in the topic, the first word, that I would be questioning the recorder So lets focus on that. As I agree, that using SQLite even for configuration items is a no-brainer (if we use a database for this purpose). Heck, splitting these two apart makes even sense here.

So looking at the recorder it’s main purpose seems to be recording (historical) data of sensors/events, right? Arguing that we only store ‘slowly changing data’ e.g. sun/moon times etc quickly becomes moot. But I fully admit, I only know my use cases. So far, my home-assistant useage is very limited. I have my car (renault) connected, and get sensor data there. I also have a ‘slimmelezer’ connected, but I still only have a feraris meter ;p

So looking at the ‘energy monitoring’ use-case, which is very prominently part of the default installation, the time-series use-case becomes even stronger.

TimeSeries databases are designed for this use-case I would argue, so it’s not unlogical to think into this direction.

Perhaps also state what you think is lacking using the existing solution (and yes, I know you are making an enquiry at this point in time). Are there particular performance issues? Do you want to improve specific way data can be retrieved or otherwise processed?

So lets focus for now on a single use-case, energy-monitoring. This is an area HA seems to pull attention towards. A standard setup would store time-series (sensor) data of energy consumption in an SQLite database (Yes, this is still time series data, even if we do not format it as such and use it as such, the nature of the data is time series).

So lacking here, firstly is the fact that we cause a huge number of unnecessary writes. Again, lets assume a raspberry pi setup with an SD card. Not unlikely (even if not recommended), but also a hosted server with SSD storage. I think I don’t have to go into detail why it’s bad, but for those unaware, all these writes (even if they are somewhat cached in RAM first) for each datapoint, could cause a (small) write (with huge impact on SD wear). On a server farm, this also gets amplified by a huge number of incomming writes, there is a cost associated to this fact (This could financially impact nabu-casa).

Secondly querying the data can become problematic if a lot of data is written, because also reads are inefficient (maybe that’s why we have min/max/avg in the table?), but the default setup purges data after 10 days, right? However we have a statistical data model now, where we store summary data of those 10 days, right? I haven’t dug into how this is done exactly, but I also saw that for now, it doesn’t matter, because we use the same inefficient storage methodolgy. So after X years of writing a lot of aggregated data, we have a messy and slow database.

You also mention things that are not specific to the DB tech chosen are the “duplicates” you noted. Better to keep the issues apart.

Technically correct, but conceptionally I don’t think it matters. The current database design decisions hint towards a solution.

But to make it more clear, I am actually not suggesting to use influxdb for the recorder, the database tech is too different. It could/should certainly be possible in the future of course, to have the recorder expanded to also use influxdb.

I’m actually suggesting (based on my assumptions and lack of current knowledge, hence the initial questions I asked) that we first re-organize the database/tables to be time-series oriented. E.g. to dumb it down, a simple table with just 3 fields. timestamp, metadata_id, value. Secondly, and this is an important database design decision, evaluate if it makes sense to keep all sensor data in a single table, or have a table per sensor, which would reduce the columns to just two timestamp, value. The metadata_id is now moved to the table name. The choice here is whether the dominant query is based on timestamp or metadata_id.

Let me explain (even goes a little to deep for now). If we want to query all sensors on timestamp X, having a single table where we can state ‘select * from statistics where timestamp between X-1, X+1’; In that case, its best to have single table. I can’t imagine this usecase too much however, and imagine it more likely we just want to get graphs/historical data on a single sensor (e.g. select * from metadata_id. The other thing is still possible, but require more complex joins. But due to my ignorance, I don’t even know if this ‘cross-data query’ is currently even possible with recorder. I fully understand btw, that the current database design is what it is due to reasons, which I was asking about in my start-post. It does influence of course which columns to put in there.

Lastly, remember that SQLite is the default DB, which makes it super easy for new users to get going. I’m pretty sure that is an important aspect here. Choosing another, less general (albeit more powerful), DB technology might create an unwanted barrier. This is probably why the third-party setup for an Influx DB would be preferred: It’s optional and for power users, because the existing setup strikes a good compromise.

Yes/no/maybe. With that I mean, SQLite is certainly easier for newcommers. sqlite3 file.db and get going. Use a nice gui, no extra tooling required. But also, newcommers I’d expect to use the recorder as the interface to the database, it abstracts the database doesn’t it? Regardless, as you rightfully mentioned, if we get the database format in better shape, the database tech is still irrelevant (as you rightfully pointed out).

Now assuming our database is the 3 columns I mentioned above, with regards to database tech, I’d be proposing to add support for timescaledb. I also understand that this is something most are unaware of, so let me explain it a bit.

Timescaledb is an extension to postgresql. It works with pure ‘standard’ SQL commands. This is important, because it means that pySQLAlchemy can still be used. So what makes timescaledb differnt? Actually nothing. You use your postgresql database as normal. E.g. you can run a timescaldb enable postgresql instance as normal (without any timescaledb advantages).

The addition to recorder would upon detecting/requesting the table would need to be converted. So what does that mean? SELECT create_hypertable('conditions', by_range('timestamp'), migrate_data => true);

After that, the table still has the same columns, the queries are still the same, but now timescaledb can do its thing.

There’s a bit more, with regards to creating indexes (if not existing) e.g. (and now I’m way over my head, but as I understood it) to have an index on a view where you have multiple data types in the same table (like we do now). But also aggregation. Timescale can do aggregation ‘behind the scenes’ automatically. And a few other things (Things like averaging uppon an new data entry).

But! These would be optimizations to the current SQLite solution! The SQLite solution shall, and will, function as normally. But, in the case of aggregation, it would need to be done manually for SQLite, Mariadb etc.

koying · June 18, 2024, 1:06pm

Just to avoid confusion: No configuration data goes into the db.
All the configuration is in the YAML files (old style) or in JSON in the .storage subdirectory.

The db serves 2 different purposes, though, indeed :

transient data: detailed state changes et al. Those are purged after 10 days, by default
statistical data: Long term statistics, aggregated on 5min and 1h durations.

I’m pretty sure I’ve already seen proposals to have those 2 usages split between 2 databases, which would make plenty of sense imo (separation of concerns).

parautenbach · June 18, 2024, 3:10pm

There’s a little bit of metadata in the DB, but configuration actually goes to JSON-structured files in .storage (besides YAML, as you mentioned).

Since you can actually run HA without a DB (i.e. nothing going to the recorder), I’d just say that HA is really (mostly) only concerned with current states. So, from that perspective, I’d say the main goal of the DB is actually to persist current state values, and secondary, state history. Keep in mind that the recorder settings don’t apply to the statistics tables.

I mostly agree with this, except to say that energy monitoring and management is a specific subset of HA functionality – only for users that have an energy dashboard.

Just to make sure I follow you here: HA states are only written when they changed, but statistics can have duplicated values, and you’re referring to the latter, giving your original post and example data, right? I have high write rates on the sensors from my inverter, since I get an updated every second for a multitude of sensors. In that case, the number of writes cannot be minimised, unless one does some filtering (in other words, the values actually change).

Not quite. States are purged (by default) after 10 days, but statistics are retained indefinitely (at this point in time). All these tables are in one logical DB, but it’s really two separate structures/schemas, with the only link the entity ID.

In its abstract form, I understand what you’re saying. Given HA is an automation platform (primarily), I think if you want to get traction on this, you’d need to show examples of existing functionality that can be improved or what useful new functionality can be added.

For example, I’ve often had a need to check (in a condition, as part of an automation) whether a sensor was in a particular state at a certain point in time. You can achieve this e.g. with a SQL sensor or using a variety of helpers, but I’d much rather this be a built-in feature.

oliv3r · June 18, 2024, 4:13pm

Thank you for pointing that out That does help clear out the discussion and the function of the database!

As for splitting it in 2 databases, I’m not sure I agree. For the SQLite usecase, sure; it can make some what sense, but in the end, you just want two tables, which we already have.

So then lets look back at the current table, since short-term and long-term follow the same datamodel, lets (for now) speak of them as one.

Currently, the database looks like this:

 id            | integer                  |           | not null | generated by default as identity
 created       | timestamp with time zone |           |          | 
 created_ts    | double precision         |           |          | 
 metadata_id   | integer                  |           |          | 
 start         | timestamp with time zone |           |          | 
 start_ts      | double precision         |           |          | 
 mean          | double precision         |           |          | 
 min           | double precision         |           |          | 
 max           | double precision         |           |          | 
 last_reset    | timestamp with time zone |           |          | 
 last_reset_ts | double precision         |           |          | 
 state         | double precision         |           |          | 
 sum           | double precision         |           |          |

My database, has no data in any of the non-_ts columns. Why is that? The _ts columns hold ironically timestamps, but in float8 formats. Why is that?

Secondly, we keep min/max/avg values for each data entry. For the aggregation table I can somewhat understand this, but for the short_term table, this is something I don’t understand. Over what period is the min/max/avg? The last X entries? some time interval? Why can’t this be calculated realtime. What’s the historical value here?

parautenbach · June 18, 2024, 5:07pm

Schema changes over time. Old columns were migrated, but I’ve never seen them removed (potentially due to the risk involved if there’s a bug).

To really compare, create a clean HA instance. I cannot confirm, but I suspect you won’t see the old columns.

koying · June 19, 2024, 7:41am

My opinion, and this thread is a perfect example, is that “transient” data might stay in a relational db, and sqlite is perfectly fit for the job unless you go multiuser for whatever reason, while historic data might go to some timeserie db.

So 2 db to have the opportunity to have 2 different engines / usages, really.

oliv3r · June 19, 2024, 9:05pm

This is a new install well new 5? weeks ago. And going from timestamp to float sounds like a downgrade in terms of functionality

parautenbach · June 20, 2024, 5:53am

Yeah, I’m not sure whether old columns got removed from the schema scripts. I haven’t done the comparison myself and I haven’t been able to find the create scripts. It must be somewhere in the core repo, but where?

True, it doesn’t help with inspecting data, but certain native types can be better for performance. Still, it’s fairly simple to convert it in a query if needed. Given this topic is mostly about optimisation, I’d think convenience is of a lesser concern.

oliv3r · June 20, 2024, 8:08am

I don’t disagree, and I thought of this as well; but look what’s inside the database. It mostly IS timeseries data already in there …

oliv3r · June 20, 2024, 9:09am

True, it doesn’t help with inspecting data, but certain native types can be better for performance. Still, it’s fairly simple to convert it in a query if needed. Given this topic is mostly about optimisation, I’d think convenience is of a lesser concern.

Ah, well timescaledb can create a hyper table where one column needs to be a timestamp (preferred), an integer (which is just a timestamp), or a date (a shorter timestamp). A float isn’t supported. So then my question comes back. A timestamp is a 64bit int. ints are generally more efficient in terms of CPU and probably also in terms of SQLite. So why was it converted from a sensible format (timestamp) to a float? (Btw, a timestamp as just a 64bit integer with a certain representation during ‘select’ I would think). And its a SQL required thing, anyway …

parautenbach · June 20, 2024, 12:00pm

A Unix timestamp as an int goes down to the second, whereas a float can do milli- and microseconds. Datetime objects typically go down to at least the millisecond, usually.

What is your efficiency concern? It’s not the PK.

oliv3r · June 21, 2024, 4:51pm

PostgreSQL: Documentation: 16: 8.5. Date/Time Types states that a timestamp is a 64bit type down to the microsecond, dunno how SQLite handles this.

But this is an interesting point. SQLite doesn’t state anything on precision here Date And Time Functions so no idea if those are 64 bit timestamps or something else. They do mention ‘subsecond’ precision.

So now the question turns around again to the beginning, ‘what is the purpose of the recorder’. If we want timeseries data with microsecond precision, then using a cludge like SQLite with floats, because the native types do not support proper 64bit timestamps is a bit odd. While you are correct, you could do microsecond, nano second but also even minute precious with a float, the float general problem remains. They are not precise and working with them is ‘harder’ (comparing floats is generally a big no no, just because of that).

I get the ‘starter’ argument, but then the argument also quickly becomes ‘but it’s not a serious solution’, and if it’s not a serious solution, then the whole mariadb/postgres ‘possibility’ also becomes questionable.

If on the otherhand we want to say the recorder will be a serious solution to do timeseries recordings (without adding much complexity compared to what we have now), then we should look at my proposal. Where we then state 'sqlite to get started, you get sqlite subsecond resolution, if you want to store data longer and more serious, you can use the same recorder, but have to use postgresql (or if possible mariadb, or once added in the future influxdb) as recorder backend.

The proposed just ‘use influx db’ doesn’t integrate nicely though with HA’s default way of working, so it’s a subpar solution. Also, for those that would prefer to use a different database then influxdb also ‘suffer’ …

parautenbach · June 22, 2024, 7:27pm

Btw: core/homeassistant/components/recorder/db_schema.py at dev · home-assistant/core (github.com)

For example:

    __tablename__ = TABLE_EVENTS
    event_id: Mapped[int] = mapped_column(Integer, Identity(), primary_key=True)
    event_type: Mapped[str | None] = mapped_column(UNUSED_LEGACY_COLUMN)
    ...

koying · June 23, 2024, 8:42am

What are your arguments that having a separate db for purely time series / statistical purposes is subpar, exactly?
IMO, that that’s actually more elegant than trying to squeeze different / conflicting requirements in a single DB, not even considering that not everyone is interested in the statistical part.

The discussions regarding timestamps precision differing per engines goes my way, tbh.

oliv3r · August 2, 2024, 7:35am

What are your arguments that having a separate db for purely time series / statistical purposes is subpar, exactly?

It doesn’t integrate nicely into home-assistant basically. Everywhere where the recorder is used (e.g. a graph of temperature changes on a sensor), this happens natively with the recorder. If I use influxdb, I don’t have that. I have add extra stuff, and it’s not inherently linked, that if I look at a sensors history, I automatically look at the influxdb graph.

I’m not against having different databases or squeezing in conflicting requirements. But making poor design choices because ‘it doesn’t matter, use something better if you want better’ is not ideal.

The fact remains, we have a database, that does store history, that does store statistics. Why would we not make it the best it can be (which doesn’t really change all that much.

E.g. using a proper timeseries table format. It’s just how we store the data, in the end, it’s all still the same. Using a proper timestamp instead of a float, goes my way tbh. You want better/more precision Use a proper database.

As for not storing each and every entry, but instead only store when data has changed, that’s just how you enter/retrieve data. The database is the same. Obviously your graphs, tables, tools need to account for the fact, that rather then having ‘one entry per second’ they now get ‘one timestamp per entry’, but for the database things don’t become ‘more difficult’ or ‘only possible in certain scenarios’. Just different.

dbaarda · September 16, 2024, 3:44pm

I too have some highly opinionated ideas about how time-series should be implemented, and after some initial attempts to get some power/energy/cost monitoring and consoles working in HA using helpers and automations, I’m of the opinion what HA has right now is barely fit-for-purpose. I suspect most people just don’t realise how bad/wrong it is and assume what it shows is correct.

To be fair, I don’t think there is a single timeseries DB I’ve seen that gets it right since RRD. They all make terrible mistakes in their core assumptions and are very inefficient. I have fragments of a doc scattered around that I should pull together and turn into a “how to do a TSDB right” document.

However HA’s attempt seems particularly bad, and the number of third-party efforts to replace it like Turn HA into ULTIMATE Data Analysis platform shows that other people are not happy with it either.

I might start pumping some effort into helping these efforts…

koying · September 16, 2024, 4:15pm

Still not sure why some people wants to fit HA into something it was never meant to be: a data analysis tool or a TSDB.

If they are not satisfied with the arguably little it provides, do like all businesses in the world do: export the needed data to a proper tool for their needs.

I work for a software company, and we do exactly that: provide some baseline tools to get simple metrics, and possibilities to integrate with 3rd parties for specific needs.

At the end of the day, we want HA to be the perfect home automation tool, and every effort put into something else than this core objective just delays it.

dbaarda · September 17, 2024, 8:56am

For me, a lot of what I want to see on dashboards and quite a lot of the automation I want requires good historical data, and ways to summarise it accurately. That means good timeseries support.

For example the current Energy Dash lies to me because I cant (yet) figure out how to convert the battery AC power input sensor (which goes negative when discharging and only updates every 15m) accurately into the total energy in and total energy out sensors the dash needs. I’ve created helper template sensors and integrators for these and they almost work, but HA has some strange behaviours that are making this not work correctly, and the energy output graph shown is clearly not the integral of the power output graph.

Part of the problem seems to be HA’s behaviour of ignoring sensor updates where the value doesn’t change. I’ve attempted to work around that by adding an automation to do “update entity” on all the relevant sensors but that doesn’t seem to work. Using Max sub-interval on the integrators also doesn’t seem to do the right thing. I’ve seen people saying they work around this by making their template sensors add a tiny random offset to the value, but man, does that feel dirty.

Also the energy dash seems to (try to) show energy usage and cost for each day, with the current day only showing the day so far. I’d like to be able to see what it looks like summarised for the past week, or the past 24h.

In the long term I’d like the automation to use historical trends in household power consumption, solar production, and my energy provider’s dynamic demand/supply pricing to optimise when to charge/discharge the battery.

I’ll keep fiddling… maybe I can make it work, but I feel like it shouldn’t be this hard.

koying · September 17, 2024, 5:06pm

If the source data is wrong, no amount of DB optimization will solve that.

If you think some calculations are not done properly in HA, neither. That would be a bug.

You can change the timespan in the top bar

oliv3r · October 25, 2024, 7:34am

@dbaarda, thank you for sharing your frustration

But we have to agree, that there’s some things that could be done to improve the situation yeah? baby steps

My two biggest issues are that a) we don’t use timestamps, but use floats instead, so converting it to tsdb is not possible; which could be fixed easily.

Secondly, duplicate data storage; which is somewhat easy with stored database procedures, but I don’t think they work with sqlite; so a read-asses-write cycle might be needed, which puts an extra burden on the databse of course. However I’m not familiar enough yet with TSDB in that it will actually remove duplicates or this is something an application would have to deal with.