History accumulates lag over time

Hi folks,

It’s been a few months, maybe years, that I’ve been facing a very frustrating issue: history accumulates lag over time.

After a day, I typically get history graphs lagging a couple hours behind.
By that I mean the latest events in history graphs date from a couple hours ago.
I can easily accumulate 12 hours of lag before needing to restart HA for unrelated reasons and restart with a fresh and up to date history (and lose the previous “lag time” data forever)

I’ll admit I have a lot of sensors, some of which updating very dynamic values every 5 seconds.
The machine I run HAOS on (mini PC with N5095 Celeron + 8GB of RAM and 250GB SSD) is not showing any sign of strain, CPU is typically 95% idle and RAM 60% free.
Home assistant container in docker reports 5 to 15% CPU, and uses 2GB of RAM.
I run a separate (offsite) PostgreSQL server for the database, on very capable hardware, that is not running into any kind of performance issue.

Any idea what the problem might be?
Or how to debug this further?

Thanks!

I agree: 12 hours is HUUGE! Data loss is definitely unacceptable.
Is the buffering for the remote server the bottleneck? How much traffic passes between them? Is your connection speed (overall - end to end) sufficient?
Anything polling history endlessly? Dashboards with lots of graphs?
Are history queries going back to day zero instead of a fixed back period?
Are any of your add-ons, apps and devices going around in tight loops, not ceding resources and hogging them?
Unnecessarily database locking?
Are you using MQTT heavily?

My home internet connection is 10GPON, so about 8G/8G in practice with very low latency.
The remote database server is an Incus container hosted on a dedicated server with Hetzner, it has a 1G/1G fiber connection with low latency as well.
The connection between the two sites goes through a Wireguard tunnel, and latency overall between the two server hovers around 46ms.

I don’t have anything that continuously polls history, all the graphs I have on my dashboards are only displaying current status data.
Most of the history I consult day to day is clicking on a sensor to get more info, and that usually queries 24h of data.

I don’t think any part of the system is particularly resource intensive… at least whole system htop, or glances show very low WAIT or INTERRUPT CPU time share, so I suppose nothing seems to be keeping the system locked.

If the database is being locked in use, that would be by HA anyway, so I’d assume HA could handle it correctly. The database is not even being polled outside HA.
The only other thing happening to it is a continuous stream replication of all the databases on the server to two other remote locations, but it bears no impact on performance overall. Certainly none that is measurable and/or impacted other applications on other databases.

And finally, no, I do not use MQTT at all; why? Is MQTT supposed to be very resource intensive?

Thanks for the info.
This database stream replication may actually be counter productive, causing data loss rather than preventing it. Any reason you are applying this kind of technology for what is a home based hobby grade application?
At what level is replication happening? Transaction level? Database level? Is transaction rollback configured?
What database is HomeAssistant using? MariaDb, Postgress?
Why cannot it be local, rather than across a network, and you back it up daily, risking at most 24 hours of data, rather than the fairly consistent 12 hours you appear to currently have, and been putting up with?

Local disk drive access speeds, including cache, are a far greater speed than that can ever be achieved over your link.
Can you do a throughput network speed test between the local and remote system. Not published speeds, but actual speeds. 46ms is not low latency in my books. I get single digits with a far slower connection.
Can you give Docker an extra 2Gb of RAM for headroom and see what happens to HomeAssistant? Memory page thrash and fragmentation may also be an issue, and that should reduce it.
If you are collecting so much data from your sensors, maybe MQTT to your remote server may be another option.

I would start there…

I’m on PostgreSQL 17, and the replication is done through is a streaming replication, it’s an inbuilt system within Postgres that works at server level with WAL shipping.
In case a replica is offline, the main server will keep the WALs locally until the replica reconnects at which point it will catch-up and the main server will delete the WALs.
There is no transaction rollback in the plan, I’m more concerned about hardware failure and service interruption than a catastrophically wrong action.
It is extremely unobtrusive and commonly used in enterprise setups that are far more intensive than even the busiest HA setup you could imagine!

On the other hand, daily backups of the database is a far more complex thing to achieve.
I decided to move the database off the HAOS host because of load concerns, and as mentioned for an easy backup setup, all the while centralizing all the databases I use throughout various applications, some of which work natively with replication (they do the write transactions on the main server, but can offload the read transactions to the replicas)

Actual speeds (iperf between servers) is indeed slower, but ample enough : ~370Mbps TX ~150Mbps RX

Docker does not limit memory usage per container on HAOS apparently:

CONTAINER ID   NAME            CPU %     MEM USAGE / LIMIT    MEM %     NET I/O   BLOCK I/O       PIDS 
07e244830a90   homeassistant   14.69%    1.737GiB / 7.55GiB   23.01%    0B / 0B   413MB / 970MB   101 

Any reason? I mean beyond the fact that it’s not usual?

Well, you are trying to fix some delay in your history data being populated, right?

Then you are using some history method that I’ve never heard of as opposed to the stuff that Home Assistant is tweaked to use.

My first thought is throw that away and use the default, OR, if you know your system and what is happening, fix that.

You are kinda in your own version of things here so you are the support.
I hope you find help here, but not able to help myself.

Dump WAL for your HomeAssistant replication. You are losing your data, so it is ineffective anyway as it cannot roll back lost transactions. You are applying enterprise solutions inappropriately for a cut down Linux based hobby grade solution.

Your local SSD drive connects to your local hardware at the usual speed of 6Gbps, guaranteed, not multiplexed or shared by others. You are crippling your system by effectively forcing data reads and writes across a WAN.

There are tools for database tuning and network performance and tweaking things like OS Buffers to improve throughput. I would suggest that using them on HomeAssistant would be like putting lipstick on a pig - pretty but deep down, still very ugly. You still end up with a grunting mess.

Alternatively, I suspect WAL locally is being buffered to memory by Docker rather than written to disk locally, hence no rollback possible when you pull the rug out from under it. You’re kind of defeating the five nines enterprise goals by doing this. Carefully consider where your data is at any given time - is it in memory, at risk of power loss, OS crash, hardware fault, etc? is it in limbo somewhere in a network cache? Is it in memory, unwritten on the remote (shared, hosting, virtual) server? Has the WAL buffer been flushed to disk at the hosted server? Has that WAL replication transaction now been delisted back on your local server? Has a new WAL transaction been created for every transaction to replicate to the other two servers, TWICE? Have they been transmitted, applied, and purged to the other replications?
Massive overhead. Crippling actually.

The reason I asked about MQTT is it is a very light protocol and survives well in a distributed enterprise network environment. Get your sensors to securely send the data direct to the remote host where it can be dumped into the HomeAssistant database by the MQTT broker there, and then replicated back the other way. Put the load out there and see if the replication traffic is lighter.
MQTT has rollback capabilities too, (The Q in MQTT actually stands for queung), working in with network interruptions to prevent data loss. Losing local hardware or software will not interrupt the local data collection process as the sensors are talking to the remote MQTT broker direct. If the data doesn’t get through, say due to the other variable - network reliability, it gets buffered locally. Robustness objective achieved with less angst.

Alright, MQTT seems like an interesting option, but it will be quite the undertaking to move sensor data over it!
I’ll need to work on architecture before committing to the move.

Do you have resources on MQTT architecture and weighing the differences between brokers? (Or is Mosquitto still the de-facto standard?)