From my experience, the 2-mic HAT is superior to the S3Box3, at least as long as the S3Box3 within HA uses only one of the two build in microphones. The willow project (https://heywillow.io/) has fully integrated the S3Box3 and the voice recognition is quite impressive. But for now I am quite happy with the wake word detection on the S3 BOX3.
Regarding the general STT Part, I favor the approach the snips voice platform took with generating/traning an individual model based on the users needs. Only the used intents has been trained which helped to keep false positives low, especially in a multi language environment.
Thinking of something like a (nabu casa) service that generates/train an individual model based on the exposed entities and custom sentences / automations at your local HA instance. Although, to be honest, this sounds more like a deprecated approach that will be useful for low end devices. With AI and the increasing need for local AI processing Power, the way to go may be a dedicated GPU (e.g. CUDA) at home (e.g. Pocket AI | Portable GPU | NVIDIA RTX A500 | ADLINK , https://coral.ai/).