Custom Integration to import long term statistics from a file like csv or tsv

I managed to import several years with energy usage data. I was struggling a little bit to get the transition between the imported data and the previous logged data to align properly with the sum value. Had to look up the value in the database and use excel to calculate new historical values.

My mistake! The fact that the data was deleted again was caused by incorrect file format.
I’ve already imported a few things and the integration works perfectly.
Thanks again for the great work Klaus!

This is cool! I am going to try this for my solar data. Will report back here!

Can you share more information regarding the incorrect file format? That this leads to import does not work is clear for me, but how can it import first and then delete after some minutes?

I anyhow would like to add some checks (e.g. correct timestamps, correct numbers), and what you have done could also be a good check.

You are welcome, good to read that others can use this integration!

Great!

Yes, min/mean/max is much simpler than sum/state. I had the same issue. Any ideas for improvements here are welcome.

Probably you already found out, but here is my understanding:

A sum/state at 12:00 in the CSV is the value of the counter at 12:00, lets say 4
A sum/state at 13:00 in the CSV is the value of the counter at 13:00, lets say 10

The value for the hour between 12 and 13 is then the difference between the two values above → 6.

I don’t know of any easy solutions to this. But the service could have a parameter with an offset (positive or negative) that is added to every sum read from the file. In that case I would be able to skip the recalculation of the sum value in Excel, but I still have to find out what offset value to use.

But I had to use Excel anyway for calculating state values so modifying the sum value was only a smal part of a larger process. An offset value as a parameter is not important, I think.

Very simple , I used “sum/state” for a table “min/mean/max” because the “max” column was missing in the restored data. Some process should then carry out a plausibility check and delete the data again. I’m now also noticing problems on individual days, as they say, the “sum/state” is complicated. There are values ​​where the “sum” value doesn’t fit and I suddenly have -2000 kWh consumption in one hour. I haven’t figured that out yet. Somehow the difference doesn’t match any values:
z.B.:

sensor.pv_i1_ac_energy_kwh	kWh	08.10.2023 05:00	132.407	40.741
sensor.pv_i1_ac_energy_kwh	kWh	08.10.2023 06:00	132.418	40.752
sensor.pv_i1_ac_energy_kwh	kWh	08.10.2023 07:00	132.737	41.071
sensor.pv_i1_ac_energy_kwh	kWh	08.10.2023 08:00	133.165	41.498
sensor.pv_i1_ac_energy_kwh	kWh	08.10.2023 09:00	133.61	41.944

I still have no idea how this happens.
This is the error with the -2167.99 kWh:

sensor.pv_i1_ac_energy_kwh	kWh	11.02.2024 18:00	2259.646	2167.98
sensor.pv_i1_ac_energy_kwh	kWh	11.02.2024 19:00	2259.659	2167.993
sensor.pv_i1_ac_energy_kwh	kWh	11.02.2024 20:00	2259.659	0.0
sensor.pv_i1_ac_energy_kwh	kWh	11.02.2024 21:00	2259.659	0.0

It’s clear how the value comes about, but I still have to research how I fix it.

Addendum:
I don’t know if I understand this correctly, but I’ll write it down. There are 2 variants, values ​​that are getting higher and higher, such as PV production since construction. “Sum” is always the difference from the first “state”. But then there are also values ​​that regularly go to 0, PV daily production for example. Since “sum” is the sum of all increases. So it’s actually always the sum of all increases. What I don’t understand yet, when do I start with 0? If I import a value now, then there are already current values, if the “sum” of my imported values ​​is already high, what happens if the current value is lower?
Has anyone come further than me?

At least I found that you can repair the values ​​in the developer tools - statistics. I’ll take a look at what the results look like in the database starting this evening.

I’ll continue writing now, maybe it will help someone who also has partially corrupt data.
I’ve now played with correcting the values ​​in the developer tools statistics:
If I correct a value like “2259,659 0.0” above, then the entire “sum” column is recalculated up to the current value and all values ​​after that are OK. This is actually a quick and practical method.

Hi @chofstaetter ,

I did not really understand much of this, but what I did for importing counters is, I used a negative sum value. This worked perfectly.

So, lets say the already existing HA data for PV production start at 01.01.2024 00:00, with state = 1000 and sum = 0 (from what I understood, state is the value of the sensor, and sum will always start with 0 at the beginning).

When I have a daily PV production of 10, the values at 31.12.2023 are state = 990 and sum = -10, and so on. The difference between two time stamps should always be identical for state and sum. This is totally OK in your first table.

I did not understand how your error happens. What did you import, and what was the result? If you import your second table at Feb, 11th, then it is clear that you get a -2167.99. between 19:00 and 20:00. Or was this a wrong export of your “old” database you want to correct? If this is true, you should also be able to recalculate the sum column in e.g. excel and import it afterwards. But good to know that repair values works in this way.

Not sure if that helped :wink:

Maybe I find some time to explain the counter import with more details in the readme.

First of all: Awesome effort! Thanks.

I’m wondering if it’s mandatory to include all hours. I have exports from my energy supplier from years back that I would like to import. However, I only have the values per day. Can I just leave the other values empty? Or do I need to specify all hours for every day and only do the growth at midnight for example?

You should enter the values ​​every hour, otherwise you will always have lines and no connection between them. At least that’s how it is for me when data is missing in between.

For me the reason is that I had a defective SQLite twice and some of the data was corrupted, hence these strange “sum” values. At the time of import, I did not yet understand the logic of the values, and I have now simply recorded the understanding process here. In my case, correcting the values ​​is probably better after the import, since I already had last week’s data in the database and had to adjust the values ​​anyway, otherwise there would be large negative jumps when changing from old to new.
Sorry if it looked like it was a question, I just wanted to contribute my findings on the topic, unfortunately I can’t program.

That would not be a huge issue for me. I don’t have the data per hour, so there is also no need to be able to see it per hour. Seeing it per day is enough.
What would you suggest? just increase once per day and keep all hours equal?

It’s a matter of taste whether you want to see a staircase or a ramp. But if the original data is daily, an increase at midnight would be the most honest representation

I also had only daily data, and added hourly values, with only one change per day (the rest of the hourly values are identical). But you can also start with the other approach, if its fine for you, its fine. If not, do another import with the changed approach.

Unfortunately I have to correct myself, my statement that correcting an early “sum” value corrects all values ​​afterwards is not true. It worked once, but I think there was just one mistake. Today I had sensors with several errors, so each one needs to be corrected. It is better to correct the data before or with a routine during the import.

So my situation is a bit similar as described above. I have 3 years of energy/gas data in HA and I have the remaining values in an excel sheet. So obviously the sum starts at 0 when the sensor was initialised in HA 3 years ago. I have created a sheet where I manually calculated a coupe of weeks before the 3 years of data. So the SUM gets negative. Similar to the example TSV provided. But the problem is that the oldest value than has a SUM that negative. And therefor the first value in the energy dashboard is also negative.

This is part of the data:

Any idea how we could avoid this?

One solution I was thinking about is to extract all data from the specific sensor from the statistics table and remove all entries from the database. Then merge the HA data and the offline data in a new table and recalculate all the SUM values. Then use your script to re-import everything again. I guess it should work, but I’m not sure what this does with the costs that are also maintained somewhere. But I guess it should work as the costs seem to be in another sensor that is not connected directly to the statistics.

Maybe a workaround: Just import e.g. one more year without changing the values, so that you have zero every hour. Then the negative value is so much in the past that it does not hurt anymore, because you never look at it?

1 Like

I’m currently working on recalculating the sum column using the SQL procedure, in pgAdmin4. I also have a script that does this for PostgreSQL, but I haven’t tested it in the productive environment yet. If I get around to it at the weekend and it works, I’ll be happy to share it here. But it is made specifically for pgAdmin and will not work with SQLite without adjustments.

2 Likes

Procedure for correcting the “sum” column in the “statistics” table

The procedure is ready and tested. It corrects the entire “sum” column of the “statistics” table for the specified entity. It could also be used for the “statistics_short_term” table, but you would need to replace all instances of “statistics” with “statistics_short_term”. However, I simply deleted this table because it will be rebuilt anyway.

Here is the procedure:

CREATE OR REPLACE FUNCTION fix_statistics_sum(entity_id VARCHAR(255))
RETURNS VOID
AS $$
DECLARE
    curr_id INT;
    prev_state FLOAT := 0;
    curr_state FLOAT;
    curr_sum FLOAT;
    sum_diff FLOAT;
    changed_rows INT := 0;
    first_value FLOAT;
    last_value FLOAT;
    curr_row RECORD;
BEGIN
    -- Hole die erste Summe
    SELECT sum INTO curr_sum
    FROM statistics
    WHERE metadata_id = (SELECT id FROM statistics_meta WHERE statistic_id = entity_id)
    ORDER BY start_ts
    LIMIT 1;

    -- Öffne einen Cursor, um die Daten Zeile für Zeile zu durchlaufen
    FOR curr_row IN 
        SELECT id, state
        FROM statistics
        WHERE metadata_id = (SELECT id FROM statistics_meta WHERE statistic_id = entity_id)
        ORDER BY start_ts
    LOOP
        curr_id := curr_row.id;
        curr_state := curr_row.state;

        -- Berechne die Differenz zum vorherigen Zustand (state)
        sum_diff := curr_state - prev_state;

        -- Aktualisiere die Summe nur, wenn der Zustand (state) größer oder gleich dem vorherigen Wert ist
        IF sum_diff >= 0 THEN
            -- Aktualisiere die Summe um die Differenz
            curr_sum := curr_sum + sum_diff;
            changed_rows := changed_rows + 1;
        END IF;

        -- Setze die Summe für die aktuelle Zeile
        UPDATE statistics
        SET sum = curr_sum
        WHERE id = curr_id;

        -- Setze den vorherigen Zustand für die nächste Iteration
        prev_state := curr_state;

        -- Speichere den ersten und letzten Wert
        IF changed_rows = 1 THEN
            first_value := curr_sum;
        END IF;
        last_value := curr_sum;
    END LOOP;

    -- Gib die Anzahl der geänderten Zeilen und die ersten und letzten Werte aus
    RAISE NOTICE 'Anzahl der geänderten Zeilen: %', changed_rows;
    RAISE NOTICE 'Erster Wert: %, Letzter Wert: %', first_value, last_value;
END;
$$ LANGUAGE plpgsql;

Start command:
SELECT fix_statistics_sum(‘sensor.placeyourentityhere’);

The procedure is written for Postgresql and tested with pgAdmin4. It will not work with Sqlite without modifications.

What to do:

  1. Open the database in pgAdmin.
  2. Open the “statistics” table with View Edit Data.
  3. Copy the procedure into the “Query” window and execute it with “Execute Script”.
  4. The procedure can now be found under “Functions”.
  5. Delete the contents of Query and insert the start command(s), of course with the desired sensor.
  6. At the latest now, exit Home Assistant, otherwise there will be a statistics error at the time of insertion.
  7. Execute the procedure with “Execute Script”, multiple lines can also be executed at once.
  8. Under “Messages” it is written out how many records were processed and the first and last “sum” value.
  9. Optionally, delete the “statistics_short_term” table.
  10. Restart HA.

Notes:

  • The procedure must be executed for each entity separately.
  • The start command must be adapted for each sensor.
  • The procedure can be used to correct errors in the “sum” column.
  • The procedure can also be used to update the “sum” column after adding new data.
    After one hour, I had huge jumps in the consumption data, at the time when the imported data met the new data. I was not able to adjust the procedure so that it calculates the sum column back from the current value. However, I was able to easily fix this using the statistics correction in the developer tools.

I hope this is helpful!
​​
​​

1 Like