Improve performance energy dashboard with multiple energy sensors

My Energy dashboard has multiple energy sensors:

  • 2 grid consumption sensors
  • 2 return to grid sensors
  • 4 solar production sensors
  • 1 gas consumption sensor
  • 1 water consumption sensor
  • 9 devices sensors

In the year overview of the Energy dashboard the retrieval of the statistics is quite slow (more than 6 seconds) using MariaDB. I did some database query analysis and saw that the retrieval of the statistics took quite some time.

I added the following index (ix_statistics_metadata_id_start_ts_sum_mean) on the statistics table using HeidiSQL and now the retrieval is less than a second.

This could help if you run into the same issue.

@bdraco Nick, worthwhile optimisation to include?

1 Like

Everyone would have to carry the cost of the keeping that very large index up to date as well as the storage space for it; the trade off is only going to be worth it for very few people.

A lot of the work is still in python code to reduce the data from hourly to yearly and each data point in the database has to be iterated. Historically that has been very slow and most of the work was being done in the python code. We have significantly optimized that in the last three months but it would be surprising if we managed to make that operation faster than the sql data retrieval as it was previously profiling out at 95ish% of the time.

If it’s something you need adding the index manually might make sense but that query should already be covered by:

CREATE UNIQUE INDEX ix_statistics_statistic_id_start_ts ON statistics (metadata_id, start_ts);

Mind you, the index is confusingly named since its covers (metadata_id, start_ts) and not (statistic_id, start_ts) so maybe this was a manual migration and the index above was created with the wrong columns?

1 Like

@bdraco, I agree that the ix_statistics_statistic_id_start_ts index name is confusing and should have been named ix_statistics_metadata_id_start_ts. It is the existing name of the index and I didn’t change it. I checked some old SQLite backups and the index is named the same there.

For me the (metadata_id, start_ts) index does not provide the right performance but adding the index with sum and mean included does. I agree that there is always a tradeoff when adding an (extra) index because it impacts inserts (maintain index) and more storage.

Did you ever consider to let the database handle the grouping of the data instead of handling that by Python code? The below SQL statement automatically groups the data per month for several sensors and is very fast (using the new index).

MariaDB

SELECT metadata_id, FROM_UNIXTIME(start_ts, ‘%m’) AS month_num, MAX(sum) - MIN(sum) AS month_sum FROM statistics
WHERE
metadata_id IN (6,7,8,9,10,352) AND
start_ts >= 1640991600 AND start_ts < 1672527600
GROUP BY metadata_id, FROM_UNIXTIME(start_ts, ‘%m’)

SQLite

SELECT metadata_id, STRFTIME(‘%m’,start_ts,‘unixepoch’) AS month_num, MAX(sum) - MIN(sum) AS month_sum FROM statistics
WHERE
metadata_id IN (6,7,8,9,10,352) AND
start_ts >= 1640991600 AND start_ts < 1672527600
GROUP BY metadata_id, STRFTIME(‘%m’,start_ts,‘unixepoch’)

We would be more than happy to offload all the work being done in python to the database.

We would need a query that worked with sqlalchemy so we didn’t have to maintain a whole lot of branching. The challenge would be to get that working with sqlalchemy since we support SQLite, MariaDB, MySQL, and PostgreSQL.

If thats something you are interested in building, the code that reduces the statistics could be replaced here core/statistics.py at e36fd5f222887e78eb5df983291e8c30ccb28f64 · home-assistant/core · GitHub

I will have a look and see what I can do.
Unfortunately I am not a Python developer (more of a C#, Swift developer) so it could be a steep learning curve :slight_smile:

I think that the performance impact for me has to do with the number of records that are returned and need to be processed by Home Assistant (Python).

For the above 6 energy sensors the default query returns around 51.000 records. I have 19 energy sensors in total so that really adds up looking at the number of records that have to be processed.

Reducing data in pure python is always going to be slow.

Building a way to offload this to the database would be a nice win since it would reduce amount of data coming back in to be processed by orders of magnitude.

On a side note, its nice that we are discussing 6 seconds being too slow here since last year it was taking more than 5 minutes for similar queries :wink:

You are totally right. The performance optimization work that has been done until now was tremendous. Hopefully we can also get this part optimized because the Energy dashboard is used more and more.

I was able to cut near 50% of the run time off the python code in Improve performance of sums in the energy dashboard by bdraco · Pull Request #91342 · home-assistant/core · GitHub

2 Likes

Question (not a Python dev here mind you) : is it possible to integrate a native (compiled) code block, with well established in/out data interfaces and no side effects, into Python code ? Data reduction operations like this, or other heavy lifting data processing, is trivial in a language like C and would be orders of magnitude faster. It would also open the possibility of using SIMD intrinsics for this (for the most commonly supported architectures).

Nice work! What is the exact optimization?

For the other route to offload the work to the database I did some more investigation:

I think that the easiest way to work around these database specific functions is to introduce extra columns which hold the month/day/hour values and are filled by Python during the INSERT. Those columns can than be used in the retrieve query by the GROUP BY statement.

Statistics

I did some tests and than the query would look something like below.

SELECT metadata_id, start_ts_month AS month_num, MAX(sum) - MIN(sum) AS month_sum FROM statistics
WHERE
metadata_id IN (6,7,8,9,10,352) AND
start_ts >= 1640991600 AND start_ts < 1672527600
GROUP BY metadata_id, start_ts_month

To get the best performance the column should be included in the index.

I didn’t have time yet to dive into SQLAlchemy and what is supported there. That’s on my todo list :slight_smile: . This provides at least a way to workaround the use of specific database functions.

A quick Google search tells me that it is possible but you would sacrifice platform/hardware independency which is something you don’t want.

Checking online I found a good example explaining the use of date functions in sqlalchemy. My conclusion is that is handled specifically per database. You can call the needed function directly within the sqlalchemy statement. The downside is that you have to manage the different database dialects yourself.

Is this something that is allowed within the Home Assistant architecture?

That’s a non-issue. This code would be compiled for the platform during installation and binaries could be provided for the most common platforms (ARMv7 and x86). This is already done with HA on its Rust dependencies.

About SIMD (if used), two specific code paths for ARMv7/v8 and x86/64 could be added, which probably covers 99% of the realworld use cases. Anything else would just fall back to non SIMD code (which would still be ridiculously faster than its Python equivalent).

Adding new string columns to the database like that would significantly increase the size of the database.

Another problem would be daylight savings time as well as the user has the ability to change the timezone of instance which would make all the values incorrect

You are right that the extra columns would increase the database but it would be slightly because they are integer columns and not strings. The value of the columns is calculated based on the start_ts timestamp so should reflect the correct value.

Like I said it would be a workaround to prevent the usage of database specific functions. My preference would be to use the database specific functions instead of using a workaround. After digging into the database differences. The below functions/syntax should be used in sqlalchemy per database.

  • SQLite
    STRFTIME(’%m’,start_ts,‘unixepoch’)
  • MariaDB
    FROM_UNIXTIME(start_ts, ‘%m’)
  • MySQL
    FROM_UNIXTIME(start_ts, ‘%m’)
  • PostgreSQL (not tested)
    EXTRACT(MONTH FROM to_timestamp(start_ts))

You would also need to wrap the functions to convert to the time zone that the hass is configured with

I think you might be underestimating the cost of storing varchars. I expect it would be a 30% increase in the size of the table

You can do that in multiple ways.

  • Using the timezone of the connection (default behaviour)
  • Setting the timezone specifically for a session

SET time_zone = ‘America/New_York’;

  • Explicitly converting to a timezone in the query if the stored timestamp is in a specific timezone

CONVERT_TZ(datetime, from_tz, to_tz)