The Current State of Voice - July 2024

synesthesiam · July 25, 2024, 10:31pm

@donburch888 Regarding your soapbox comments, I’ve found several things over time that seem to prevent really good voice experiences.

The first by far is the vast differences between what groups of users want and are capable of running (or willing to run). Assist’s default intent recognizer is notoriously rigid partially because of the target hardware (Pi 3/4) but mostly because any dependencies on machine learning libraries are a nightmare to maintain in HA core. Pushing better stuff into add-ons/Docker containers is the only realistic way around this, but this then makes the installation process quite a bit more complex

The biggest pain point is GPUs. Many different users want fully local voice, and you can get amazing results but only if you have the right hardware. Even better results are possible with additional training (fine-tuning), but creating an add-on/Docker container for a training environment with a good user interface and keeping it up to date would be a full time job itself!

Hardware is the second issue, which we are fortunately working on in the form of our VoiceKit. It will have a good audio chip on-board for noise filtering/echo cancellation (XMOS), run ESPHome, and be capable of playing media (external 3.5mm port is available to attach to a better speaker than what it comes with). This is what I’ve been focused on lately, and probably will be for most of the rest of the year.

Lastly, I’ve found that I simply can’t keep up I struggle with the deluge of questions, issues, and PRs for all of the things I’ve been part of over the years. I honestly want to help, but I don’t how best to do it. Many people have suggested giving certain contributors more rights to specific projects. I’d be happy to do this, but not many people have volunteered yet