MariaDB hangs Haos?

Yesterday I installed the backend mairadb in my haos instance:

recorder:
  db_url: mysql://homeassistant:PWD@core-mariadb/homeassistant?charset=utf8mb4
  purge_keep_days: 730

Tonight at about 4 o’clock(purge time?) the whole instance hung, using 100% cpu according to the charts.

When I opened the interface this morning, it was unavailable for a couple of minutes. But then it suddenly worked and everything was back to normal. The logs are blank.

Can anyone suggest what happened?

No idea what happened but this

:astonished:

Is probably not a good idea unless you have a very small Home Assistant setup.

Use Influxdb for anything longer than a month.

Thanks for the advice, but I don’t think that’s the reason. As I wrote before, I installed mariadb less than 24 hours ago, there is almost no data.

Either do I. That’s not what I said. I just said it was a bad idea.

i don’t know how you run HA, but if on a virtual machine it might be an option to install mysql on the host? then it would not be an issue to store 2 years :wink:

Yes, i’m running haos in VM. 2 years set for test reasons. I wanted to see how the db would grow over time.

My MySQL database holds 100 days (on external host) and is now 8.1 Gb (not too big system, about 300 entities)

How do you know it was Mariadb?

What hardware are you running on? How many sensors are you recording? How big is maria’s HA database after 24h?

Clearly something is overwhelming the hardware if the 100% hardware is to be trusted. If you can connect to maria, SHOW PROCCESSLIST; or SHOW FULL PROCESSLIST; will show you exactly what commands are being run, but if it’s a single cpu machine then that may not be possible.

Do you have automatic backups switched on?

Also I wouldn’t use MariaDB to hold 2 years of data, instead use Influx and select exactly what you want to record. All works really well with Grafana.

Inside haos - no. I have daily/weekly snapshots of VM image.

As I wrote above, it is done for test reasons.
But really, I see no problem with storing the history of numerical values for 2 or 20 years.

SO it’s not the backup.

I’ve been using MariaDB for years and have no issues as seen above - but I do only hold 3 days worth (about 1.6GB rolling).

The only reason I suggested Influx is because it’s actually designed for long term numerical data whereas MariaDB isn’t. If you are going to use MariaDB just make sure you peel back the recorder to only hold what you want because there are some tables in there that will get awfully large awfully quickly.

I don’t know exactly, but that’s the only thing that changed.

Very small, <100mb. I disabled it for now.
Running on VM with 2vCPU and 2G ram. There is no problem increasing the limits if it is necessary. Host is 2xXeon/128G ram.

It looks more like a software problem to me.I managed to notice that the VM stands at 100% of one CPU core, before it fixed itself :slight_smile:

The problem is that I cannot reproduce this situation. I have tried calling purge manually, but it has no effect.

Now I can’t use MariaDB because of the risk of the whole system hanging up. I will either have to stay on sqlite or try to run postgres on the host.

or installl mysql on the host :grin:

For reference, I’m running HA in a docker with 2 years of database retention and many hundreds of sensors. I run mariadb on the host vm (Debian 11) and the size of the database is only 678Mb.

Your hardware sounds a little light on memory for a database run within it, and it may be worth throwing another couple of Gb at it to test that if you get in the position of being able to replicate this issue. But 2cpus and 2gb on a reasonably modern xeon should be fine, even for HASS and mariadb running in it. 100% CPU is not in itself a hugely useful statistic, not without breaking down what that core is actually doing. Slow or overly contended storage will easily cause one or more cpus to hit 100% due to IOWAIT, which isn’t really cpu cycles at all. Quick tip: htop, turn on extended cpu details and see the breakdown - but you’re running in a vm, so you can probably already have historical access to storage stats - worth a peep at those for that time, then you’ll see if it was disk access that was causing the load. If it was iowait, then that could still be mariadb doing an OPTIMIZE after data has been deleted (unlikely given your small size of database), poor indexes (IME, HA’s schema is pretty good and I haven’t noticed this myself) or something else like a backup (discounted here), or anything else overwhelming the storage. Note that a storage snapshot/backup from your VM manager can cause high IOWAIT and high CPUs on vms as it copies an image (with or without trying to quiesce the file system).

As @aceindy suggests, running mariadb elsewhere is very possible if you have an existing install - it doesn’t need to be on the same vm, but if it’s the only thing you want maria for, then it might as well keep it within the HA vm and just ensure it’s resourced properly. I use it primarily to feeding historical data to several Grafana dashboards, so it’s quite useful to me.

Sorry, waffling:

Takeaways: If it is reproducable; check storage load. Check CPU details in the vm, specifically for IOWAIT. Check processes to see what’s using the cpu (again, htop makes it easy).

If it’s random: Add some ram temporarily and/or try and nail it down to a trigger through reading logs.

If it was just that once and it’s fine since: Not every mystery is sysadminning is solvable…

Adding to the above - if it’s started using SWAP (as 2GB is not a lot to play with) it will start to grind along.

There was no abnormal load on the disk. What was abnormal was that this “freeze” condition stopped when I tried to open in the UI. That is, the UI somehow “fixed” it. Unfortunately HA has very scarce logs, I did not find any traces there. Only messages about reconnecting sensors, when it was “fixed”. Which suggests that not only the database hung, at least the communication with the sensors also stopped.

These are zfs snapshots - they happen instantly, without any load on the disk.

If it happened once, it can happen again. And according to Murphy’s law, it will happen exactly when it is needed to power off the water pump because of the leak sensor signal :slight_smile:


# cat /proc/meminfo  
MemTotal:        2023912 kB
MemFree:          356096 kB
MemAvailable:    1498820 kB
Buffers:          124540 kB
Cached:           934732 kB
SwapCached:            0 kB
Active:           970036 kB
Inactive:         520352 kB
Active(anon):     425400 kB
Inactive(anon):     1500 kB
Active(file):     544636 kB
Inactive(file):   518852 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:        667888 kB
SwapFree:         667888 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:        431216 kB
Mapped:           269240 kB
Shmem:              1120 kB
KReclaimable:     112168 kB
Slab:             143020 kB
SReclaimable:     112168 kB
SUnreclaim:        30852 kB
KernelStack:        6352 kB
PageTables:         7864 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     1679844 kB
Committed_AS:    3053348 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       12200 kB
VmallocChunk:          0 kB
Percpu:             1016 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      117304 kB
DirectMap2M:     1978368 kB
DirectMap1G:           0 kB
#

It looks like there is really not enough memory, even though swap is not used. I will add another 2G, thanks for the tip :slight_smile:

1 Like