SQL to insert missing statistics rows

rrozema · May 9, 2024, 3:28pm

We’ve all been there, a device is temporarily not available and you only notice it after a few days. The missing data is causes gaps in our graphs and it just doesn’t look nice. Not much of a deal to most, but a nuisance to some like me.

If however you’re using custom additions you’ll pretty soon find that they not all deal to well with such gaps in your data: the results can vary, but often you’ll get un-explainable spikes or holes.
examples are:

The problem is these tools don’t seem to correctly retrieve the statistics value when the data point at the start of the period you’ve selected is missing. I’ve tried addressing this with some of the developers, but found it is hard to get through in most cases. And I do understand this: they’ve put in a lot of work already and the problem is only sporadic and highly relying on the data plus input at hand. So I’ve found myself a work around: whenever I find some data missing in my database, I add the missing data points by running a sql statement to insert them as the average of the last item before and after the gap. Since the query to do this is pretty hard to write (I am a professional in this area) I want to share it here with all of you. I tested it with HA’s default database engine sqlite and it works there. The query only uses pretty standard sql, so I guess it will work in some other sql dialects as well, but I can give no guarantees it will work if you’re using another db engine.

with cte as (
	select 
		m.id as metadata_id, 
		ss.start_ts, 
		se.start_ts as end_ts,
		ss.state, 
		se.state as state_end,
		ss.[sum],
		se.[state] - ss.[state] as state_difference,
		se.[sum] as sum_end,
		se.[sum] - ss.[sum] as sum_difference
	from (
			select m.id, m.statistic_id, m.unit_of_measurement
			from statistics_meta m 
			where m.has_sum = 1
				and m.statistic_id = 'sensor.energy_produced'
		) m 
		inner join statistics ss on ss.metadata_id = m.id
		inner join statistics se on se.metadata_id = ss.metadata_id and se.start_ts > ss.start_ts and not exists (
				select * 
				from statistics se2
				where se2.metadata_id = ss.metadata_id
					and se2.start_ts > ss.start_ts
					and se2.start_ts < se.start_ts
			)
),
cteNumbers as (
	select row_number() over (order by (select null)) as n
	from statistics_meta
)
insert into statistics( metadata_id, start_ts, state, [sum])
select
	metadata_id,
	start_ts + n.n * 3600 as start_ts,
	state + n.n * (state_difference / cast(((end_ts - start_ts) / 3600) as float)) as state,
	[sum] + n.n * (sum_difference / cast(((end_ts - start_ts) / 3600) as float)) as [sum]
from cte
	inner join cteNumbers n on (n.n < ((end_ts - start_ts) / 3600))
where 
	start_ts >= strftime('%s', '2024-04-30')
	and start_ts < strftime('%s', '2024-05-02')
	and end_ts - start_ts > 3600
order by 
	metadata_id,
	start_ts, 
	n.n;

To use it, you’ll need to run sqlite3 from a terminal window. Then replace the entity name (‘sensor.energy_produced’ in the above query) by your entity name and replace the date values to give a range. The query will only add rows when there is actually a gap in the range specified, so no harm will be done if you specify an incorrect date range. The range is only needed to make the query perform better, it could theoretically be left off, but you’ll most likely have to wait a very long time.