Compression comparison for DB backups during upgrade

Tags: #<Tag:0x00007fc418a4c9a8> #<Tag:0x00007fc418a4c7a0> #<Tag:0x00007fc418a4c340>

For those of you using Docker with the standard DB (or even a non-docker install), who want to back up their config prior to pulling a new release, you may also want to backup your DB file, even if it is large.

So you need/want compression, I did a comparison of some methods, speed and size, on a 1GB db file. Files are read off a SATA SSD, processor is a 4-core 4-thread Haswell Xeon, files then are written to another SSD. DB is from 2021.3 release.

XZ:
Native compression needing no external libs or compressors
Best compression but slow
tar -cJf HABACKUPDB.tar.xz /var/homeassistant/home-assistant_v2.db
5 min 26 sec, 99.4MB

PLZip:
High performance implementation compatible with regular lzip decompression
4 times faster than XZ with almost the same compression ratio (both use LZMA)
tar -c -Iplzip -f HABACKUPDB.tar.lz /var/homeassistant/home-assistant_v2.db
1 min 12 sec, 104.5MB

PBZip2:
High performance implementation compatible with regular bzip decompression
Very fast, but just ok compression
tar -c -Ipbzip2 -f HABACKUPDB.tar.bz2 /var/homeassistant/home-assistant_v2.db
29.0 sec, 146.6MB

ZSTD level 10:
Super fast and similar compression to bzip
tar -c -I"zstd -10 -T0" -f HABACKUPDB.tar.zstd-10 /var/homeassistant/home-assistant_v2.db
10.2 sec, 150.4MB

ZSTD level 5:
Even faster but compression still not great:
tar -c -I"zstd -5 -T0" -f HABACKUPDB.tar.zstd-5 /var/homeassistant/home-assistant_v2.db
4.7 sec, 165.4MB

So who wins? Depends on the need, ZSTD-10 is faster than it takes to pull the new docker image (if done in parallel, but 5 is obviously the fastest. ZSTD will also have the fastest decompression time. PBZip retains compatibility with bz2 files, which means compatibility with pretty much any archiving software, even ancient ones. PLZip has the best combination of file size and speed, it is 4 times faster than XZ because it can use all 4 cores. LZip is also apparently a better format from a data recovery perspective. It is also well supported. I did not bother with gzip and lzop because of their poor compression ratio. I tested ZSTD level 15, but the compressed data size was within 1% of level 10 for this type of data, but at half the speed. Level 17 is where size starts to drop, but at that point it is the same speed as PLZip but still not anywhere close to the compression ratio.

I would use PLZip if retaining lots of backups is important, or if you have larger databases.
I would use ZSTD-5 if you retain a single backup, have a small db, or want the shortest downtime between upgrades.
Faster sysems with ample memory like this make better use of ZSTD-10

For mem requirements, PLZip and ZSTD-10 can use huge amounts of memory, over a gig in some cases. PBZip and ZSTD-5 use far less. ZSTD level 9 will typically use half the compression memory of 10. Memory constrained systems should use single threaded LZip for best compression or ZSTD levels 4 through 9 for best speed. 4 is a good chunk faster and uses even less memory, but the compression ratio takes a similar hit.

On systems with 2 or less cores, ZSTD-5 is probably the best option. Single threaded LZip has the best compression and uses far less than the multithread version, with a 1/cores speed ratio.

As for the rest of the config dir, it is much more compressible and much smaller unless you have a bunch of media files in there or something:

tar --exclude=home-assistant_v2.db -c -Iplzip -f HABACKUP.tar.lz /var/homeassistant
2.64 sec, 1.7MB

tar --exclude=home-assistant_v2.db -c -I"zstd -10 -T0" -f HABACKUP.tar.zstd-10 /var/homeassistant
0.42 sec, 1.8MB

Going with ZSTD there for sure! At that speed someone might assume the operation failed to complete. I would assume the log file would get compression ratios in the high 90s, so a large log file would not add much to the archive size, a previous test saw a 30MB log only add 0.3MB to the file size using lzma based compressors. Levels higher than 10 do not help much if at all.

I created a bash script which performs the backup with dated file name, and with configurable source and destinations for the config files and backup. Just change the ha_dir and target_dir.

ha_dir='/var/homeassistant'
target_dir='/mnt/backups/Home Assistant'
use_date=$( date -u +%F-%H-%M-%S )
tar --exclude=home-assistant_v2.db -c -I"zstd -10 -T0" -f "$target_dir"/HABACKUP_$use_date.tar.zstd "$ha_dir"
tar -c -I"zstd -5 -T0" -f "$target_dir"/HABACKUPDB_$use_date.tar.zstd "$ha_dir"/home-assistant_v2.db

source and destination must NOT have a trailing / or it will not work correctly

I then insert this into my docker compose upgrade script before the pull

docker-compose down
./backup.sh
docker-compose pull
docker-compose up -d --build homeassistant

Now I only need to run a single script to update HA which now backs up all config in case I need to do a rollback, knock on wood

I could probably make the db compression ratio configurable, but I am going for the shortest downtime, and will manually cull the older backups