Will statistics backfill missing data or only fill from now forward?

puterboy · August 19, 2024, 8:02pm

I had some sensor data infected by large erroneous spikes that I deleted manually from the home-assistant_v2.db state table.
Of course, the short and longer term statistics were also corrupted by the spikes.

If I just delete the statistics for the relevant metadata_id from now back to before the first spike, will the statistics recorder automatically regenerate all the (now) missing timeslots from before the first spike until current OR will it only fill in statistics real time from now on?

What about if I wiped out all the statistics data for the relevant sensor?
(I was hoping not to have to wipe out all the statistics history and just have it rebuild from just before the last bad data.)

Regardless, what if anything can I do to trigger historical recalculation of the statistics data for a particular sensor?

tom_l · August 19, 2024, 8:49pm

LTS are not calculated for existing data. Only new data.

puterboy · August 19, 2024, 11:22pm

So is there any way to fill in and regenerate old statistics? (other than manually generating using sql fu)

vingerha · August 20, 2024, 5:49am

You can try this… Recorder - Spook a scary powerful toolbox for Home Assistant.

puterboy · August 20, 2024, 1:48pm

Thanks, cool reference but it only seems to allow importing long term (1hr) statistics and you still need to recalculate those yourself

So, I ended up writing a bash script that when fed a list of timestamps corresponding to out-of-bounds states for a given sensor, does the following:

Deletes the states from the states table
Fills the holes by correcting old_state_id for the next state in the chain
Corrects the min/max/mean for the statistics 1-hour bucket containing each deleted state in the list
Corrects the min/max/mean for the statistics_short_term 5-min bucket containing each deleted state in the list

(of course sensors with ‘sum’ instead of ‘mean’ would need to be treated differently but that is even easier to do)

Works beautifully and my history charts are all clean again

vingerha · August 20, 2024, 2:06pm

Not sure if I follow your thought.
In order to have statistics you need to feed data somewhere and keep that somewhere to make it statistics. If you have statistics values outside of HA then you can import them. There is nothing ‘calculating’
What you are doing is not LTS (or does not seem to be) but you are happy with it and that is what counts

WallyR · August 20, 2024, 3:01pm

I think many could use your SQL-fu, so many if you could post it with a disclaimer about it being provided as-is and there is no guarantee that it will not mess it all up.

puterboy · August 20, 2024, 8:43pm

What I mean is the following.
My use case is that I want to remove an outlier data point that distorts the statistics history.
For example, sometimes the rtl_433 MQTT integration gives a spurious reading that can be 100x the actual. This one bad data point can severely skew average statistics.

So I would like to delete the bad data point.
But then I need to correct the statistics.
All the work is in recalculating the statistic for the interval post removing the bad data point.
It doesn’t help me that much to have to import a whole file when all I really need to do is to update the min/max/mean for a single corresponding entry in the statistics and statistics_short_term tables

tom_l · August 20, 2024, 9:47pm

Just don’t record the outlier in the first place: https://www.home-assistant.io/integrations/filter/#outlier

puterboy · August 20, 2024, 11:40pm

Cool to know about those filters… Thanks for sharing!!!

Definitely could be good to get rid of true outliers like when battery % or humidity is higher than 100 or temperatures are out of reasonable ranges etc.

Indeed, I posted a more sophisticated version of amr2mqtt that for “consumption” variables checks whether the current value differs from the previous by too much absolutely and/or relatively (percent) – since rtlamr was occasionally picking up electricity or gas total consumption at 100x of previous value or zero.

But in my signal processing experience, it’s not always easy to know in advance what is an outlier or what is high pass vs low pass – vs what is normal or even important abnormal signal variation.

Indeed, sometimes it’s a PITA to properly specify outliers.
For example, my freezer has an auto-defrost cycle. Usually the freezer is around 0 degF but it spikes to about 26-30 during a defrost and it has a certain morphology and timing that I can recognize.
But it’s not trivial to write a filter that will see defrost cycle at 26 as normal but the 33.8 spurious signal I got recently with latest release of OpenMQTT Gateway as spurious – especially to know how to specify all possible scenarios prospectively.

Said another way, if your filtering is too loose, you still get annoying outliers. If it is too tight, then you might miss important valid signals signifying important real underlying events.

So, I tend to be conservative on my filtering at the expense of getting occasional annoying outliers that the OCD side of me then wants to correct manually.

tom_l · August 21, 2024, 12:15am

There’s another annoying issue too: if you receive an outlier as the first value after a restart that becomes the new “normal” and all actual valid readings are rejected.

puterboy · August 21, 2024, 1:01am

I made a similar mistake in setting the outlier bounds too tight on my water meter… I was wondering why I showed no water consumption… and was worrying that maybe my rtl_sdr hardware had failed or software had crashed…

Then I tested it manually and it all worked… and I remembered that I had set an outlier bound that was obviously exceeded and then each subsequent reading became further and further away from the last valid reading making them outliers too…
So I ended up relaxing my outlier for the worst possible case and beyond

vingerha · August 21, 2024, 5:12am

For few/single point that would usually be done via the Dev Tools > Statistics tab but (!!) this is not working since a while, updating the value changes the ‘sum’ but does not change the ‘state’ (both in the statistics table), leading to a mismatch in the table . I raised a ticket already in core.