Hi everyone, Florian here from SEPIA Open Assistant.
Since I’ve built about a dozen SEPIA smart-speakers over the last years, based on open software and hardware, I thought I’d quickly share my experiences so far.
In general I’d say there are 3 basic types of devices:
- The minimal satellite for close-range voice input (~20$ - e.g.: mic + ESP32-S3 or RPi Zero)
- The basic smart-speaker (~90$ - e.g. RPi4 with mic and speaker)
- The fully featured smart-speaker/display (>150$ - e.g. Mycroft Mark II)
Type 1:
The device is basically a remote microphone. It outsources all processing to a remote server (HA, SEPIA, Rhasspy, etc.) and can be built with cheap components. One of my favorite builds is the Raspberry Pi Zero 2 W with ReSpeaker 2-Mic HAT. ESP32-based devices can be even cheaper, but in my opinion its easier to write the software for a real Linux system .
Advantages are the price and size, disadvantage is that the features are usually pretty basic since it does not implement the full client with feedback and multi-turn dialog etc. The classic use-case is: push-to-talk, one sentence, close range.
Type 2:
With a Raspberry Pi 4, a microphone HAT and a small speaker (~10W, 4/8 Ohm) you can build a more advanced client that is even able to run speech-recognition on-device. For Whisper it is too slow, but I have good experience with Vosk + custom language models. This is actually my SEPIA daily driver at the moment.
I’ve probably tried all RPi mic HATs you can find , ReSpeaker 2-mic, 4-mic circular/linear, 6-mic, Waveshare, IQaudio etc., I even built my own, but in the end I sticked with the classic ReSpeaker 2-mic HAT. The biggest problem is that the open-source software for microphone arrays (beam forming) is not very good, so you don’t really profit from more than 1 microphone right now. The same is unfortunately true for all other DSP functions you need to challenge something like an Echo device. I’ve spent hours and hours with Pulseaudio plugins for noise-reduction, beam-forming and acoustic echo cancellation (AEC), but in the end the results where never really good enough to play music and listen for wake-words at the same time or use the microphone from a distance of more than 3m reliably.
Type3:
This is the best you can build right now and the main difference to Type 2 is the microphone. Mycroft built the Mark-II with a custom voice HAT for the RPi4, the SJ201 Daughterboard. I believe the reason they did this was, because they came to the same conclusion as me: To make open-source speech recognition work, you need the best microphone + DSP you can get with sota performance in AEC, noise supression, beam forming and an integrated speaker driver. The Sj201 board has a XMOS XVF-3510 voice processor that works really well. The only comparable device I’ve tested so far is the ReSpeaker 4-Mic USB array with an older XMOS DSP (~70$) but this is limited to 16khz audio out (bad for music). There may be USB conference microphones out there that can do the same, but I haven’t had much luck so far. Tried the Anker PowerConf S330 once, but the results were pretty disappointing for the price (~60$).
So, to sum this up:
If you need close range, push-to-talk, single-turn voice input I think almost everything will work out for you, but if you want to build something similar to an Echo Dot etc. that can play music and listen to wake-words at the same time, works from a certain distance and with open-source ASR you need a microphone with the best DSP on-board that you can get and that usually starts at around 80$ I’d say.
The Mycroft SJ201 is open-hardware and could be a solid base for new developments, but I don’t know how expensive it would be as a stand-alone product (definitely not cheap).
The other path could be a massive effort to improve all open-source algorithms for noise suppression, beam forming and most importantly acoustic echo cancellation (for music + wake-word).
That was a bit more text than I had planned, but I hope it will help