When I opened the interface this morning, it was unavailable for a couple of minutes. But then it suddenly worked and everything was back to normal. The logs are blank.
i don’t know how you run HA, but if on a virtual machine it might be an option to install mysql on the host? then it would not be an issue to store 2 years
What hardware are you running on? How many sensors are you recording? How big is maria’s HA database after 24h?
Clearly something is overwhelming the hardware if the 100% hardware is to be trusted. If you can connect to maria, SHOW PROCCESSLIST; or SHOW FULL PROCESSLIST; will show you exactly what commands are being run, but if it’s a single cpu machine then that may not be possible.
Also I wouldn’t use MariaDB to hold 2 years of data, instead use Influx and select exactly what you want to record. All works really well with Grafana.
I’ve been using MariaDB for years and have no issues as seen above - but I do only hold 3 days worth (about 1.6GB rolling).
The only reason I suggested Influx is because it’s actually designed for long term numerical data whereas MariaDB isn’t. If you are going to use MariaDB just make sure you peel back the recorder to only hold what you want because there are some tables in there that will get awfully large awfully quickly.
I don’t know exactly, but that’s the only thing that changed.
Very small, <100mb. I disabled it for now.
Running on VM with 2vCPU and 2G ram. There is no problem increasing the limits if it is necessary. Host is 2xXeon/128G ram.
It looks more like a software problem to me.I managed to notice that the VM stands at 100% of one CPU core, before it fixed itself
The problem is that I cannot reproduce this situation. I have tried calling purge manually, but it has no effect.
For reference, I’m running HA in a docker with 2 years of database retention and many hundreds of sensors. I run mariadb on the host vm (Debian 11) and the size of the database is only 678Mb.
Your hardware sounds a little light on memory for a database run within it, and it may be worth throwing another couple of Gb at it to test that if you get in the position of being able to replicate this issue. But 2cpus and 2gb on a reasonably modern xeon should be fine, even for HASS and mariadb running in it. 100% CPU is not in itself a hugely useful statistic, not without breaking down what that core is actually doing. Slow or overly contended storage will easily cause one or more cpus to hit 100% due to IOWAIT, which isn’t really cpu cycles at all. Quick tip: htop, turn on extended cpu details and see the breakdown - but you’re running in a vm, so you can probably already have historical access to storage stats - worth a peep at those for that time, then you’ll see if it was disk access that was causing the load. If it was iowait, then that could still be mariadb doing an OPTIMIZE after data has been deleted (unlikely given your small size of database), poor indexes (IME, HA’s schema is pretty good and I haven’t noticed this myself) or something else like a backup (discounted here), or anything else overwhelming the storage. Note that a storage snapshot/backup from your VM manager can cause high IOWAIT and high CPUs on vms as it copies an image (with or without trying to quiesce the file system).
As @aceindy suggests, running mariadb elsewhere is very possible if you have an existing install - it doesn’t need to be on the same vm, but if it’s the only thing you want maria for, then it might as well keep it within the HA vm and just ensure it’s resourced properly. I use it primarily to feeding historical data to several Grafana dashboards, so it’s quite useful to me.
Sorry, waffling:
Takeaways: If it is reproducable; check storage load. Check CPU details in the vm, specifically for IOWAIT. Check processes to see what’s using the cpu (again, htop makes it easy).
If it’s random: Add some ram temporarily and/or try and nail it down to a trigger through reading logs.
If it was just that once and it’s fine since: Not every mystery is sysadminning is solvable…
There was no abnormal load on the disk. What was abnormal was that this “freeze” condition stopped when I tried to open in the UI. That is, the UI somehow “fixed” it. Unfortunately HA has very scarce logs, I did not find any traces there. Only messages about reconnecting sensors, when it was “fixed”. Which suggests that not only the database hung, at least the communication with the sensors also stopped.
These are zfs snapshots - they happen instantly, without any load on the disk.
If it happened once, it can happen again. And according to Murphy’s law, it will happen exactly when it is needed to power off the water pump because of the leak sensor signal