Will statistics backfill missing data or only fill from now forward?

I had some sensor data infected by large erroneous spikes that I deleted manually from the home-assistant_v2.db state table.
Of course, the short and longer term statistics were also corrupted by the spikes.

If I just delete the statistics for the relevant metadata_id from now back to before the first spike, will the statistics recorder automatically regenerate all the (now) missing timeslots from before the first spike until current OR will it only fill in statistics real time from now on?

What about if I wiped out all the statistics data for the relevant sensor?
(I was hoping not to have to wipe out all the statistics history and just have it rebuild from just before the last bad data.)

Regardless, what if anything can I do to trigger historical recalculation of the statistics data for a particular sensor?

LTS are not calculated for existing data. Only new data.

So is there any way to fill in and regenerate old statistics? (other than manually generating using sql fu)

You can try this… Recorder - Spook :ghost: a scary powerful toolbox for Home Assistant.

Thanks, cool reference but it only seems to allow importing long term (1hr) statistics and you still need to recalculate those yourself :slight_smile:

So, I ended up writing a bash script that when fed a list of timestamps corresponding to out-of-bounds states for a given sensor, does the following:

  1. Deletes the states from the states table
  2. Fills the holes by correcting old_state_id for the next state in the chain
  3. Corrects the min/max/mean for the statistics 1-hour bucket containing each deleted state in the list
  4. Corrects the min/max/mean for the statistics_short_term 5-min bucket containing each deleted state in the list

(of course sensors with ‘sum’ instead of ‘mean’ would need to be treated differently but that is even easier to do)

Works beautifully and my history charts are all clean again :slight_smile:

Not sure if I follow your thought.
In order to have statistics you need to feed data somewhere and keep that somewhere to make it statistics. If you have statistics values outside of HA then you can import them. There is nothing ‘calculating’
What you are doing is not LTS (or does not seem to be) but you are happy with it and that is what counts

I think many could use your SQL-fu, so many if you could post it with a disclaimer about it being provided as-is and there is no guarantee that it will not mess it all up. :wink:

What I mean is the following.
My use case is that I want to remove an outlier data point that distorts the statistics history.
For example, sometimes the rtl_433 MQTT integration gives a spurious reading that can be 100x the actual. This one bad data point can severely skew average statistics.

So I would like to delete the bad data point.
But then I need to correct the statistics.
All the work is in recalculating the statistic for the interval post removing the bad data point.
It doesn’t help me that much to have to import a whole file when all I really need to do is to update the min/max/mean for a single corresponding entry in the statistics and statistics_short_term tables

Just don’t record the outlier in the first place: https://www.home-assistant.io/integrations/filter/#outlier

Cool to know about those filters… Thanks for sharing!!!

Definitely could be good to get rid of true outliers like when battery % or humidity is higher than 100 or temperatures are out of reasonable ranges etc.

Indeed, I posted a more sophisticated version of amr2mqtt that for “consumption” variables checks whether the current value differs from the previous by too much absolutely and/or relatively (percent) – since rtlamr was occasionally picking up electricity or gas total consumption at 100x of previous value or zero.

But in my signal processing experience, it’s not always easy to know in advance what is an outlier or what is high pass vs low pass – vs what is normal or even important abnormal signal variation.

Indeed, sometimes it’s a PITA to properly specify outliers.
For example, my freezer has an auto-defrost cycle. Usually the freezer is around 0 degF but it spikes to about 26-30 during a defrost and it has a certain morphology and timing that I can recognize.
But it’s not trivial to write a filter that will see defrost cycle at 26 as normal but the 33.8 spurious signal I got recently with latest release of OpenMQTT Gateway as spurious – especially to know how to specify all possible scenarios prospectively.

Said another way, if your filtering is too loose, you still get annoying outliers. If it is too tight, then you might miss important valid signals signifying important real underlying events.

So, I tend to be conservative on my filtering at the expense of getting occasional annoying outliers that the OCD side of me then wants to correct manually.

There’s another annoying issue too: if you receive an outlier as the first value after a restart that becomes the new “normal” and all actual valid readings are rejected.

I made a similar mistake in setting the outlier bounds too tight on my water meter… I was wondering why I showed no water consumption… and was worrying that maybe my rtl_sdr hardware had failed or software had crashed…

Then I tested it manually and it all worked… and I remembered that I had set an outlier bound that was obviously exceeded and then each subsequent reading became further and further away from the last valid reading making them outliers too…
So I ended up relaxing my outlier for the worst possible case and beyond :slight_smile:

For few/single point that would usually be done via the Dev Tools > Statistics tab but (!!) this is not working since a while, updating the value changes the ‘sum’ but does not change the ‘state’ (both in the statistics table), leading to a mismatch in the table . I raised a ticket already in core.