I’ve just released a small project that allows you to use the Voxtral models for local speech to text (STT) recognition within Home Assistant Assist.
The goal is to provide a powerful drop-in alternative to the popular Whisper STT option. When trying out Whisper, I’ve been disappointed by its quality at least for German. Especially for non-English languages, Voxtral will hopefully set the new state of the art.
Regarding my project, there is for sure still a lot of optimization potential, but for me it is already working quite well and the first STT results for English and German are quite promising.
If you’re interested, I’d love to get your feedback of any kind!
PS: I hope it’s okay to post in this category, even though my project is technically not a custom integration. You can connect to the server via the regular Wyoming integration.
I love to see people working on such projects, especially since I’d like to use a German model. Thank you for sharing!
Great README you wrote there. I think the MAX_SECONDS could be explained a little better. I am not sure which audio files you are referring to and what happens after MAX_SECONDS.
My current HA host is pretty potato-like. However, I am planning on upgrading it soonish (as soon as my budget allows it) to something more powerful (looking at the Intel Core Ultra 5 235 at the moment) and will test it out then for sure. Do you know whether this benefits from NPU chips on the CPU?
PS: I hope it’s okay to post in this category, even though my project is technically not a custom integration. You can connect to the server via the regular Wyoming integration.
Yes! It is closely related to HA and fits perfectly here IMO.
I’ll clarify the MAX_SECONDS setting in the docs. It’s the maximum length of the received input audio. If the duration is exceeded, only the first MAX_SECONDS will get analysed. I’ve introduced this mainly to speed up debugging. In production it shouldn’t really be relevant. I’ve kept it in there anyway as receiving a voice command which lasts longer than 60 seconds is quite likely not desired and so this is a small protection against overloading or blocking the server by accidental voice input.
Regarding the NPU, I’m not sure. If the budget allows it, a GPU is quite likely the better match. I didn’t spend time on analyzing it further, but some spontaneous tests on my Apple Silicon M2 Max were supprisingly slow. On the GPU it’s pretty fast though.
Not sure whether you can fit it into your RAM, as using a CPU only will enforce the fp32 datatype requiring more memory. Should be easy to give it a try though. If you can’t make the default models work (fast enough), you could try one of the quantized models. Looking forward to your feedback.
I just tried the mini model in a virtual environment and the model did only run after I made 18GB of Memory available to it. I did use LXC container so there should be very little overhead, and it indicated just a few MB above 16GB actual memory usage. Therefore I think you would need to use quantized models. I want to try them too because I don’t have 16GB of memory to spare. @Johnson_145 I do not know that much about all the technical terms in the ai world. based on your link i found this bartowski/mistralai_Voxtral-Mini-3B-2507-GGUF · Hugging Face, it says it works with any llama.cpp backend. Does that include your project?
Yes, any of those models should work. I haven’t tested any of the quantized models yet though. At least theoretically, all you need to do is setting the MODEL_ID option to bartowski/mistralai_Voxtral-Mini-3B-2507-GGUF. While I think about it, I may need to make some adjustments for the DATA_TYPE option to actually support it correctly. I can have a closer look later on. Maybe you can already give it a quick try though?
Just to be sure: You do not have a GPU available, but only a CPU, right?
Yes only CPU currently. I just changed the Model_ID and the container refuses to start with the error message “Failed to initialize Voxtral backend: Unrecognized processing class in bartowski/mistralai_Voxtral-Mini-3B-2507-GGUF”. Im going to create an issue on Github with the full details.