Custom Integration to import long term statistics from a file like csv or tsv

Very simple , I used “sum/state” for a table “min/mean/max” because the “max” column was missing in the restored data. Some process should then carry out a plausibility check and delete the data again. I’m now also noticing problems on individual days, as they say, the “sum/state” is complicated. There are values ​​where the “sum” value doesn’t fit and I suddenly have -2000 kWh consumption in one hour. I haven’t figured that out yet. Somehow the difference doesn’t match any values:
z.B.:

sensor.pv_i1_ac_energy_kwh	kWh	08.10.2023 05:00	132.407	40.741
sensor.pv_i1_ac_energy_kwh	kWh	08.10.2023 06:00	132.418	40.752
sensor.pv_i1_ac_energy_kwh	kWh	08.10.2023 07:00	132.737	41.071
sensor.pv_i1_ac_energy_kwh	kWh	08.10.2023 08:00	133.165	41.498
sensor.pv_i1_ac_energy_kwh	kWh	08.10.2023 09:00	133.61	41.944

I still have no idea how this happens.
This is the error with the -2167.99 kWh:

sensor.pv_i1_ac_energy_kwh	kWh	11.02.2024 18:00	2259.646	2167.98
sensor.pv_i1_ac_energy_kwh	kWh	11.02.2024 19:00	2259.659	2167.993
sensor.pv_i1_ac_energy_kwh	kWh	11.02.2024 20:00	2259.659	0.0
sensor.pv_i1_ac_energy_kwh	kWh	11.02.2024 21:00	2259.659	0.0

It’s clear how the value comes about, but I still have to research how I fix it.

Addendum:
I don’t know if I understand this correctly, but I’ll write it down. There are 2 variants, values ​​that are getting higher and higher, such as PV production since construction. “Sum” is always the difference from the first “state”. But then there are also values ​​that regularly go to 0, PV daily production for example. Since “sum” is the sum of all increases. So it’s actually always the sum of all increases. What I don’t understand yet, when do I start with 0? If I import a value now, then there are already current values, if the “sum” of my imported values ​​is already high, what happens if the current value is lower?
Has anyone come further than me?

At least I found that you can repair the values ​​in the developer tools - statistics. I’ll take a look at what the results look like in the database starting this evening.

I’ll continue writing now, maybe it will help someone who also has partially corrupt data.
I’ve now played with correcting the values ​​in the developer tools statistics:
If I correct a value like “2259,659 0.0” above, then the entire “sum” column is recalculated up to the current value and all values ​​after that are OK. This is actually a quick and practical method.

Hi @chofstaetter ,

I did not really understand much of this, but what I did for importing counters is, I used a negative sum value. This worked perfectly.

So, lets say the already existing HA data for PV production start at 01.01.2024 00:00, with state = 1000 and sum = 0 (from what I understood, state is the value of the sensor, and sum will always start with 0 at the beginning).

When I have a daily PV production of 10, the values at 31.12.2023 are state = 990 and sum = -10, and so on. The difference between two time stamps should always be identical for state and sum. This is totally OK in your first table.

I did not understand how your error happens. What did you import, and what was the result? If you import your second table at Feb, 11th, then it is clear that you get a -2167.99. between 19:00 and 20:00. Or was this a wrong export of your “old” database you want to correct? If this is true, you should also be able to recalculate the sum column in e.g. excel and import it afterwards. But good to know that repair values works in this way.

Not sure if that helped :wink:

Maybe I find some time to explain the counter import with more details in the readme.

First of all: Awesome effort! Thanks.

I’m wondering if it’s mandatory to include all hours. I have exports from my energy supplier from years back that I would like to import. However, I only have the values per day. Can I just leave the other values empty? Or do I need to specify all hours for every day and only do the growth at midnight for example?

You should enter the values ​​every hour, otherwise you will always have lines and no connection between them. At least that’s how it is for me when data is missing in between.

For me the reason is that I had a defective SQLite twice and some of the data was corrupted, hence these strange “sum” values. At the time of import, I did not yet understand the logic of the values, and I have now simply recorded the understanding process here. In my case, correcting the values ​​is probably better after the import, since I already had last week’s data in the database and had to adjust the values ​​anyway, otherwise there would be large negative jumps when changing from old to new.
Sorry if it looked like it was a question, I just wanted to contribute my findings on the topic, unfortunately I can’t program.

That would not be a huge issue for me. I don’t have the data per hour, so there is also no need to be able to see it per hour. Seeing it per day is enough.
What would you suggest? just increase once per day and keep all hours equal?

It’s a matter of taste whether you want to see a staircase or a ramp. But if the original data is daily, an increase at midnight would be the most honest representation

I also had only daily data, and added hourly values, with only one change per day (the rest of the hourly values are identical). But you can also start with the other approach, if its fine for you, its fine. If not, do another import with the changed approach.

Unfortunately I have to correct myself, my statement that correcting an early “sum” value corrects all values ​​afterwards is not true. It worked once, but I think there was just one mistake. Today I had sensors with several errors, so each one needs to be corrected. It is better to correct the data before or with a routine during the import.

So my situation is a bit similar as described above. I have 3 years of energy/gas data in HA and I have the remaining values in an excel sheet. So obviously the sum starts at 0 when the sensor was initialised in HA 3 years ago. I have created a sheet where I manually calculated a coupe of weeks before the 3 years of data. So the SUM gets negative. Similar to the example TSV provided. But the problem is that the oldest value than has a SUM that negative. And therefor the first value in the energy dashboard is also negative.

This is part of the data:

Any idea how we could avoid this?

One solution I was thinking about is to extract all data from the specific sensor from the statistics table and remove all entries from the database. Then merge the HA data and the offline data in a new table and recalculate all the SUM values. Then use your script to re-import everything again. I guess it should work, but I’m not sure what this does with the costs that are also maintained somewhere. But I guess it should work as the costs seem to be in another sensor that is not connected directly to the statistics.

Maybe a workaround: Just import e.g. one more year without changing the values, so that you have zero every hour. Then the negative value is so much in the past that it does not hurt anymore, because you never look at it?

1 Like

I’m currently working on recalculating the sum column using the SQL procedure, in pgAdmin4. I also have a script that does this for PostgreSQL, but I haven’t tested it in the productive environment yet. If I get around to it at the weekend and it works, I’ll be happy to share it here. But it is made specifically for pgAdmin and will not work with SQLite without adjustments.

2 Likes

Procedure for correcting the “sum” column in the “statistics” table

The procedure is ready and tested. It corrects the entire “sum” column of the “statistics” table for the specified entity. It could also be used for the “statistics_short_term” table, but you would need to replace all instances of “statistics” with “statistics_short_term”. However, I simply deleted this table because it will be rebuilt anyway.

Here is the procedure:

CREATE OR REPLACE FUNCTION fix_statistics_sum(entity_id VARCHAR(255))
RETURNS VOID
AS $$
DECLARE
    curr_id INT;
    prev_state FLOAT := 0;
    curr_state FLOAT;
    curr_sum FLOAT;
    sum_diff FLOAT;
    changed_rows INT := 0;
    first_value FLOAT;
    last_value FLOAT;
    curr_row RECORD;
BEGIN
    -- Hole die erste Summe
    SELECT sum INTO curr_sum
    FROM statistics
    WHERE metadata_id = (SELECT id FROM statistics_meta WHERE statistic_id = entity_id)
    ORDER BY start_ts
    LIMIT 1;

    -- Öffne einen Cursor, um die Daten Zeile für Zeile zu durchlaufen
    FOR curr_row IN 
        SELECT id, state
        FROM statistics
        WHERE metadata_id = (SELECT id FROM statistics_meta WHERE statistic_id = entity_id)
        ORDER BY start_ts
    LOOP
        curr_id := curr_row.id;
        curr_state := curr_row.state;

        -- Berechne die Differenz zum vorherigen Zustand (state)
        sum_diff := curr_state - prev_state;

        -- Aktualisiere die Summe nur, wenn der Zustand (state) größer oder gleich dem vorherigen Wert ist
        IF sum_diff >= 0 THEN
            -- Aktualisiere die Summe um die Differenz
            curr_sum := curr_sum + sum_diff;
            changed_rows := changed_rows + 1;
        END IF;

        -- Setze die Summe für die aktuelle Zeile
        UPDATE statistics
        SET sum = curr_sum
        WHERE id = curr_id;

        -- Setze den vorherigen Zustand für die nächste Iteration
        prev_state := curr_state;

        -- Speichere den ersten und letzten Wert
        IF changed_rows = 1 THEN
            first_value := curr_sum;
        END IF;
        last_value := curr_sum;
    END LOOP;

    -- Gib die Anzahl der geänderten Zeilen und die ersten und letzten Werte aus
    RAISE NOTICE 'Anzahl der geänderten Zeilen: %', changed_rows;
    RAISE NOTICE 'Erster Wert: %, Letzter Wert: %', first_value, last_value;
END;
$$ LANGUAGE plpgsql;

Start command:
SELECT fix_statistics_sum(‘sensor.placeyourentityhere’);

The procedure is written for Postgresql and tested with pgAdmin4. It will not work with Sqlite without modifications.

What to do:

  1. Open the database in pgAdmin.
  2. Open the “statistics” table with View Edit Data.
  3. Copy the procedure into the “Query” window and execute it with “Execute Script”.
  4. The procedure can now be found under “Functions”.
  5. Delete the contents of Query and insert the start command(s), of course with the desired sensor.
  6. At the latest now, exit Home Assistant, otherwise there will be a statistics error at the time of insertion.
  7. Execute the procedure with “Execute Script”, multiple lines can also be executed at once.
  8. Under “Messages” it is written out how many records were processed and the first and last “sum” value.
  9. Optionally, delete the “statistics_short_term” table.
  10. Restart HA.

Notes:

  • The procedure must be executed for each entity separately.
  • The start command must be adapted for each sensor.
  • The procedure can be used to correct errors in the “sum” column.
  • The procedure can also be used to update the “sum” column after adding new data.
    After one hour, I had huge jumps in the consumption data, at the time when the imported data met the new data. I was not able to adjust the procedure so that it calculates the sum column back from the current value. However, I was able to easily fix this using the statistics correction in the developer tools.

I hope this is helpful!
​​
​​

1 Like

Klaus thanks so much for this great integration, I was able to import all my old power usage. I ran into a roadblock for which I wanted to share my solution to hopefully save someone else a bit of head scratching. I was getting an “Invalid timestamp” error during importing. I checked my timestamp formatting, tried different formats, tried changing timezone, etc, all of which were fine. Ultimately I discovered that Home Assistant statistics will only accept timestamps at the top of the hour (eg, 01:00, 02:00, 03:00, etc), but my power company provides data every 15 minutes (eg 01:00, 01:15, 01:30, etc). Home Assistant core recorder statistics will throw an error if the minutes is set to anything but 0.

So all I had to do is modify my data to add up the total for each hour, then it imported perfectly!

1 Like

Great that it worked!

I’ll add the issues here (like the timestamp) to the readme hopefully soon

Hello Klaus,
if my energy data, which i want to import, already contains the timestamp format as date, how can i use it, without converting it to date format ?

Hi @TTr0n
sorry, I do not understand your question. The timestamp must be available, it must be an exact hour (no minutes, no seconds), and it must be formatted in a ways that python/pandas understands it. When you use the timestamp as in the example files, you are on a safe path

Your example shows 07.01.2024 23:00 , my “Timestamp” is 1710601200
Do i have to convert this to 16.03.2024 16:00 ?

Yes, you have to.

Thx for the answer.
For using this, i need hour based data, but what to do, if i have some hours without any data? Do I really have to add every single hour from 0-23?
What about the missing hours? Do I have to fill them in manually?

And second question:
Your integration puts the data into statistics table and there i need a sum value, because its a counter value with state/sum.
Whats the best practice in m case? Delete the last 4 days in statistics table and generate this data new with the data from statistics_short_term combined with the old data from 2023 till 13.03. ? Or does it matter and my old data can also start by zero ?