In the year overview of the Energy dashboard the retrieval of the statistics is quite slow (more than 6 seconds) using MariaDB. I did some database query analysis and saw that the retrieval of the statistics took quite some time.
I added the following index (ix_statistics_metadata_id_start_ts_sum_mean) on the statistics table using HeidiSQL and now the retrieval is less than a second.
Everyone would have to carry the cost of the keeping that very large index up to date as well as the storage space for it; the trade off is only going to be worth it for very few people.
A lot of the work is still in python code to reduce the data from hourly to yearly and each data point in the database has to be iterated. Historically that has been very slow and most of the work was being done in the python code. We have significantly optimized that in the last three months but it would be surprising if we managed to make that operation faster than the sql data retrieval as it was previously profiling out at 95ish% of the time.
If it’s something you need adding the index manually might make sense but that query should already be covered by:
CREATE UNIQUE INDEX ix_statistics_statistic_id_start_ts ON statistics (metadata_id, start_ts);
Mind you, the index is confusingly named since its covers (metadata_id, start_ts) and not (statistic_id, start_ts) so maybe this was a manual migration and the index above was created with the wrong columns?
@bdraco, I agree that the ix_statistics_statistic_id_start_ts index name is confusing and should have been named ix_statistics_metadata_id_start_ts. It is the existing name of the index and I didn’t change it. I checked some old SQLite backups and the index is named the same there.
For me the (metadata_id, start_ts) index does not provide the right performance but adding the index with sum and mean included does. I agree that there is always a tradeoff when adding an (extra) index because it impacts inserts (maintain index) and more storage.
Did you ever consider to let the database handle the grouping of the data instead of handling that by Python code? The below SQL statement automatically groups the data per month for several sensors and is very fast (using the new index).
MariaDB
SELECT metadata_id, FROM_UNIXTIME(start_ts, ‘%m’) AS month_num, MAX(sum) - MIN(sum) AS month_sum FROM statistics
WHERE
metadata_id IN (6,7,8,9,10,352) AND
start_ts >= 1640991600 AND start_ts < 1672527600
GROUP BY metadata_id, FROM_UNIXTIME(start_ts, ‘%m’)
SQLite
SELECT metadata_id, STRFTIME(‘%m’,start_ts,‘unixepoch’) AS month_num, MAX(sum) - MIN(sum) AS month_sum FROM statistics
WHERE
metadata_id IN (6,7,8,9,10,352) AND
start_ts >= 1640991600 AND start_ts < 1672527600
GROUP BY metadata_id, STRFTIME(‘%m’,start_ts,‘unixepoch’)
We would be more than happy to offload all the work being done in python to the database.
We would need a query that worked with sqlalchemy so we didn’t have to maintain a whole lot of branching. The challenge would be to get that working with sqlalchemy since we support SQLite, MariaDB, MySQL, and PostgreSQL.
I will have a look and see what I can do.
Unfortunately I am not a Python developer (more of a C#, Swift developer) so it could be a steep learning curve
I think that the performance impact for me has to do with the number of records that are returned and need to be processed by Home Assistant (Python).
For the above 6 energy sensors the default query returns around 51.000 records. I have 19 energy sensors in total so that really adds up looking at the number of records that have to be processed.
Reducing data in pure python is always going to be slow.
Building a way to offload this to the database would be a nice win since it would reduce amount of data coming back in to be processed by orders of magnitude.
On a side note, its nice that we are discussing 6 seconds being too slow here since last year it was taking more than 5 minutes for similar queries
You are totally right. The performance optimization work that has been done until now was tremendous. Hopefully we can also get this part optimized because the Energy dashboard is used more and more.
Question (not a Python dev here mind you) : is it possible to integrate a native (compiled) code block, with well established in/out data interfaces and no side effects, into Python code ? Data reduction operations like this, or other heavy lifting data processing, is trivial in a language like C and would be orders of magnitude faster. It would also open the possibility of using SIMD intrinsics for this (for the most commonly supported architectures).
For the other route to offload the work to the database I did some more investigation:
I think that the easiest way to work around these database specific functions is to introduce extra columns which hold the month/day/hour values and are filled by Python during the INSERT. Those columns can than be used in the retrieve query by the GROUP BY statement.
I did some tests and than the query would look something like below.
SELECT metadata_id, start_ts_month AS month_num, MAX(sum) - MIN(sum) AS month_sum FROM statistics
WHERE
metadata_id IN (6,7,8,9,10,352) AND
start_ts >= 1640991600 AND start_ts < 1672527600
GROUP BY metadata_id, start_ts_month
To get the best performance the column should be included in the index.
I didn’t have time yet to dive into SQLAlchemy and what is supported there. That’s on my todo list . This provides at least a way to workaround the use of specific database functions.
Checking online I found a good example explaining the use of date functions in sqlalchemy. My conclusion is that is handled specifically per database. You can call the needed function directly within the sqlalchemy statement. The downside is that you have to manage the different database dialects yourself.
Is this something that is allowed within the Home Assistant architecture?
That’s a non-issue. This code would be compiled for the platform during installation and binaries could be provided for the most common platforms (ARMv7 and x86). This is already done with HA on its Rust dependencies.
About SIMD (if used), two specific code paths for ARMv7/v8 and x86/64 could be added, which probably covers 99% of the realworld use cases. Anything else would just fall back to non SIMD code (which would still be ridiculously faster than its Python equivalent).
Adding new string columns to the database like that would significantly increase the size of the database.
Another problem would be daylight savings time as well as the user has the ability to change the timezone of instance which would make all the values incorrect
You are right that the extra columns would increase the database but it would be slightly because they are integer columns and not strings. The value of the columns is calculated based on the start_ts timestamp so should reflect the correct value.
Like I said it would be a workaround to prevent the usage of database specific functions. My preference would be to use the database specific functions instead of using a workaround. After digging into the database differences. The below functions/syntax should be used in sqlalchemy per database.
SQLite STRFTIME(’%m’,start_ts,‘unixepoch’)
MariaDB FROM_UNIXTIME(start_ts, ‘%m’)
MySQL FROM_UNIXTIME(start_ts, ‘%m’)
PostgreSQL (not tested) EXTRACT(MONTH FROM to_timestamp(start_ts))