Home assistant Satellites. (client hardware for home assistant) my wishlist

Now the Year of the voice is coming along nicely (only needs a activation word) Home assistant Satellites will be a thing.

Satellites is i think the best word for defining client hardware that provides an interface to home assistant.
To replace those cloud connected wiretaps like google nest and Alexa.

Although my school education was electric engineer i ended up in quality endurance I’m not that comfortable making a device from scratch my self. All i can do is make a wish list and hopefully it is picked up.

My favorable device should have the following

HA Satellite mini:

  • wifi
  • voice assistant
  • snapcast client
  • audio out for music (from snapcast)

snapcast is a client/server setup to stream multiple channels of music to various clients. Each client can have its delay set (there will be a delay if use use a bluetooth speaker for example) allowing a perfectly synced system.

HA Satellite big:
You could think of a nice system with touchscreen. I like buttons. (my bedroom voip phone has buttons to control home assistant)

anyway there are several Home assistant Satellite designs possible. If we make up the features requested we can create a demand. If we have a demand and specs. A crowdfunded solution would have much more velocity. And perhaps if we keep the designs opensource and the case is 3d printable we could produce it every where in the world. providing our local techy and less world wide travel of satellites for environment concerns.

what do you people think?

Edit:
We got software GitHub - synesthesiam/homeassistant-satellite: Streaming audio satellite for Home Assistant

7 Likes

I like the concept a lot. As I already have speakers throughout the house (Sonos), that I can output voice assistant audio out from, and I don’t need additional dashboards. My ideal setup is:

Discrete microphone, that I can address with a wake word, and output will go through my existing speaker system with the new announce feature, so playing music will momentarily be lowered.

1 Like

SnapCast is actually the tricky part, because it requires a time indexed audio channel and shared audio channels are not time indexed.

Currently the only way to circumvent this is to input audio output into the stream before it is handed over to the SnapCast server, which will cause a delay.

Hi everyone, Florian here from SEPIA Open Assistant.
Since I’ve built about a dozen SEPIA smart-speakers over the last years, based on open software and hardware, I thought I’d quickly share my experiences so far.

In general I’d say there are 3 basic types of devices:

  1. The minimal satellite for close-range voice input (~20$ - e.g.: mic + ESP32-S3 or RPi Zero)
  2. The basic smart-speaker (~90$ - e.g. RPi4 with mic and speaker)
  3. The fully featured smart-speaker/display (>150$ - e.g. Mycroft Mark II)

Type 1:
The device is basically a remote microphone. It outsources all processing to a remote server (HA, SEPIA, Rhasspy, etc.) and can be built with cheap components. One of my favorite builds is the Raspberry Pi Zero 2 W with ReSpeaker 2-Mic HAT. ESP32-based devices can be even cheaper, but in my opinion its easier to write the software for a real Linux system :slight_smile:.
Advantages are the price and size, disadvantage is that the features are usually pretty basic since it does not implement the full client with feedback and multi-turn dialog etc. The classic use-case is: push-to-talk, one sentence, close range.

Type 2:
With a Raspberry Pi 4, a microphone HAT and a small speaker (~10W, 4/8 Ohm) you can build a more advanced client that is even able to run speech-recognition on-device. For Whisper it is too slow, but I have good experience with Vosk + custom language models. This is actually my SEPIA daily driver at the moment.
I’ve probably tried all RPi mic HATs you can find :laughing:, ReSpeaker 2-mic, 4-mic circular/linear, 6-mic, Waveshare, IQaudio etc., I even built my own, but in the end I sticked with the classic ReSpeaker 2-mic HAT. The biggest problem is that the open-source software for microphone arrays (beam forming) is not very good, so you don’t really profit from more than 1 microphone right now. The same is unfortunately true for all other DSP functions you need to challenge something like an Echo device. I’ve spent hours and hours with Pulseaudio plugins for noise-reduction, beam-forming and acoustic echo cancellation (AEC), but in the end the results where never really good enough to play music and listen for wake-words at the same time or use the microphone from a distance of more than 3m reliably.

Type3:
This is the best you can build right now and the main difference to Type 2 is the microphone. Mycroft built the Mark-II with a custom voice HAT for the RPi4, the SJ201 Daughterboard. I believe the reason they did this was, because they came to the same conclusion as me: To make open-source speech recognition work, you need the best microphone + DSP you can get with sota performance in AEC, noise supression, beam forming and an integrated speaker driver. The Sj201 board has a XMOS XVF-3510 voice processor that works really well. The only comparable device I’ve tested so far is the ReSpeaker 4-Mic USB array with an older XMOS DSP (~70$) but this is limited to 16khz audio out (bad for music). There may be USB conference microphones out there that can do the same, but I haven’t had much luck so far. Tried the Anker PowerConf S330 once, but the results were pretty disappointing for the price (~60$).

So, to sum this up:
If you need close range, push-to-talk, single-turn voice input I think almost everything will work out for you, but if you want to build something similar to an Echo Dot etc. that can play music and listen to wake-words at the same time, works from a certain distance and with open-source ASR you need a microphone with the best DSP on-board that you can get and that usually starts at around 80$ I’d say.
The Mycroft SJ201 is open-hardware and could be a solid base for new developments, but I don’t know how expensive it would be as a stand-alone product (definitely not cheap).
The other path could be a massive effort to improve all open-source algorithms for noise suppression, beam forming and most importantly acoustic echo cancellation (for music + wake-word).

That was a bit more text than I had planned, but I hope it will help :slight_smile:

11 Likes

It is not i pick it out of thin air.

It is still a dedicated SnapCast server, so no mixing of other sources, like a TTS sound.

Music Assistant also does this as well as supporting suspend and resume of playback during TTS announcements

1 Like

My vote would be a LMS client - squeezelite is s good choice.

I tried LMS. But the system cant take devices with a delay into calculation.

But its both lightweight could be the one or the other.

Music assistant does nothing at this point because its back to beta. Has 0 integration with HA at the moment.

….At the moment

edit: HA integration has been in place now for some time

Music Assistant is just a streaming service.
SnapCast is a streaming service with synchronization of players.

This one is not actively developed anymore, I would suggest to have a look at this fork:

It is working well even on a small ESP32 without PSRAM.

Type 1 does not need to be push to talk. Snips was using satellite mics with a pi-zero and respeaker that used trigger words, and then passed the following audio to the main server to be processed. There’s no need to limit this at all to push to talk OR single phrase activity.

1 Like

The Rpi-Zero can do wake-word detection with e.g. Porcupine, I agree. The reason I mentioned push-to-talk is because not all cheap hardware will be able to do it and you usually have to keep these devices rather close anyway unless you use cloud ASR, stick to English language for open-source or massively reduce your vocabulary (for German I can use Vosk + ~1000 Words custom LM in a range of about 2m).

“Single phrase activity” might have been a misunderstanding. I was talking about multi-turn conversations which require at least a speaker or display and increase complexity of the client a lot.

1 Like

The satellite only need to do wake word detection.
The rest an be offloaded to a power server.

3 Likes

I think there are 3 main questions that everybody needs to ask themselves when thinking about satellites:

  • At what distance do I want to use it? (1m, 2m, more?)
  • Do I want to play music on the device?
  • How much am I willing to spend? (20$, 100$, more?)
1 Like

I found something that could work, ReSpeaker Core v2.0
https://www.pbtech.com/product/SBCSED0006/Seeed-ReSpeaker-Core-v20-Powered-by-Axol-Core-Modu

From Seeeds own website ReSpeaker Core v2.0 | Seeed Studio Wiki
It seems perfect for this type of use.

1 Like

I don’t think that’s quite correct; I think MA supports sync under some conditions? (see below) Note that the below item, while it is a planned change, references existing capabilities … but the changes seem to be in the furthest-out schedule state.

Link: Music Assistant (V2) backlog · GitHub - Support (experimental) sync of different speakers/ecosystems within Universal group

Content:

Support (experimental) sync of different speakers/ecosystems within Universal group

Item status

Draft

marcelveldt opened on Apr 19

Description

marcelveldt

on Apr 19 (edited)

The Universal Group provider only syncs speakers of the same ecosystem.
Players that do not support sync at all will not be synced and also speakers of different ecosystems will not sync together.

Add an experimental toggle to allow some basic sync (timestamp based) of players so at least they more or less start playing the same song at the same time and not drift seconds apart.

  • start delay (prepend silence or cut frames from the beginning)
  • sync based on elapsed time (best effort, not accurate)
  • standard drift

It is just a newly added draft with no assigned developers.

1 Like