asked Music Assistant dev to add Respeaker and Koala to logic that PE is using (playback should be smoother)
added actions to start and stop Voice Assistant from Home assistant (e.g. from automations or button click). Actions: esphome.{device_name}_start_va and esphome.{device_name}_stop_va.
Thanks for you amazing work BCE.
Can’t wait for either something like snapcast support in esphome or synced playback between esphome media players.
Then I feel this is the ultimate assistant and I can ditch my Google home minis, I only keep them for some background music for now.
I have my assistant setup with “Prefer handling commands locally” & assist enabled. When my local LLM is not available/reachable there seems to be no timeout, and the purple led keeps on flashing continuously and the assistant is not usable any more. I’ve not found the proper settings in the yaml. Can someone point me to the right direction to fix this behaviour?
@LaxC Do you mean the fact that it is missing an XMOS xCORE chip? It will work with Home Assistant without one but it will not work close to as good and thus it will be a night and day difference in user experince. Without the XMOS xCORE it will be useless for end users (or annoyingly bad at best), so people will not buy another one without a XMOS xCORE chip now if they tried one with and one without for comparision.
The newly released official Home Assistant Voice Preview Edition (from Nabu Casa) that also use an XMOS xCORE is now the new golden standard that will be used as the reference hardware, so from now on if your design is not similar then it will be seen as an obsolete/outdated and unreliable design for practical use by end users. End users require the better quality that the XMOS xCORE offers.
Today I would not recommend try making and selling a new commercial voice assistant hardware product marked as ”Made for Home Assistant” and tageting end users (and not only developers) if it does have a xCORE XMOS chip.
As what the future holds there will probably be better alternatives to the XMOS xCORE chips released by others (and I am sure that Espressif will sooner or later come up with a new MCU and/or better voice SDK that can compete in the future), but for now it is is this type of XMOS xCORE based solution that wil be the golden standard.
@lurvig i see more and more people raising this problem. Let’s make it louder, so HA/ESPHome devs would know it matters!
I don’t thing Squeezelite will be a thing - it’s too heavy to co-exist with voice assistant on ESP32-S3. But synchronizing Nabu media players should be possible.
Probably as opensource has 2 great wireless audio solutions from Squeezelite to a free opensource Sonos competitor such as Snapcast that likely is better running on a Pi and the cottage industry of the high quality audio hats for Raspberry.
From thread board routers to snapcast a Pi Zero2 could run them all by just adding usb radio’s.
If it wasn’t for a LLM ASR such as Whisper a Pi5 running Vosk for ASR and a NLP toolkit would likely run intents just as accurately for a big drop in compute.
Is ESP32 such a good idea for VA as running TFlite you can fine tune by on device training, so collecting local KW capture would make a KWS that learns to be more accurate for its users.
Not being an ESP dev I don’t know if TF4Micro supports On-Device Training with LiteRT | Google AI Edge | Google AI for Developers
Futureproof homes have had requests for a additional ESP32-C6 running as a thread border router.
With an Arm board especially Pi5 it could be a complete VA without need for cloud or a N100 middleware server (Or RK3588 boards) even less as it is Whisper dictating hardware whilst Piper is aimed at lowend embedded. Thread boarder router, Snapcast server and more…
If it was on a Arm board it could do everything that is currently being asked, but unfortunately an ESP platform was chosen.
I keep saying Snapcast is too heavy for ESP32 and noticed two Snapclient repo’s that make this untrue, but still a squeeze…
I used Snapcast and Squeezelite on ESP32. They have their own bottlenecks, but all in all work okay. What is impossible is to fit a lot of functionality in one ESP chip.
Same will be true tho for Raspberry. Not only is it bigger in size and less versatile physically - it’s harder to write for, than for ESPHome, so basically users would rely on manufacturer for new features or optimizations.
I bet that all the requested functionality could fit into 16MB PSRAM and processor of ESP32-S3 like chip - but it probably would require lower-end support from Espressif, and getting rid of ESPHome overhead, thus would be written on clear C++.
If you had to write for the Pi, but you don’t as the code already exists.
From TFlite to Snapcast to even a Thread Border router Raspberry Pi | OpenThread
HA itself benefits from boards like the Pi that use the many existing python libs and many others so that you don’t need to require lower-end support from manufactures.
Tensorflow itself is a Python API that runs the highly optimised C libs provided by Google. Snapcast itself is written in C providing an API for use.
The reality to your argument is the oppisite, ESP was chosen because of ESPhome but its a square hole for the round peg as a VA.
You can bet its not going to happen, requiring lower-end support from Espressif, and getting rid of ESPHome overhead, thus would be written on clear C++.
The closed source that was purchased in was the Xmos farfield speech enhancement and can run over I2S or as a usb audio device, so you could run multiple on a Pi.
I tried to use Wyoming satellite alone - and it is unusable. Every time you need to change something you have to edit services. Things that are already done for ESP are missing. Volume ducking is awful. And it’s only satellite. Add Snapcast there, and you will get unsupportable unstable thing.
Anyways, let’s stop spoiling this thread with such conversations.
Snapcast is not like what you describe with Wyoming as its clients work as the software is mature and production ready for a long time from its initial dev of a decade or more ago? The licence in the repo is 10 years old…
Add snapcast on a Pi it works…
But yeah on ESP32 it would be a massive dev investment, but 2 repos as above do exist.
You’re not stopping there i see…
When you try to merge two or more technologies, that are working with same set of hardware (e.g. speaker in this case), the amount of edge cases to solve isn’t arithmetical progression, it’s exponential.
If you’re saying it’s easy and doable - i think you can do it. Pi + Respeaker Lite via USB is sufficient hardware, and you already have all the software you need. Please do it, i will be first to test.
I don’t dev on Xmos or micro-controller as too low level.
I had a look at the Xmos docs to try and workout what they where using for speech enhancement it it could be some form of conv-tas-net.
I did notice the USB audio device software exists in there repo’s, so likely it could make USB device with maybe the 2 mics on a 70mm daughter board with maybe balanced line drivers.
Its you who are saying microcontroller dev is easier.
The esphome people have managed the other libs so presume the usb audio lib is equally possible, but that is for the microcontroller lower level dev than me.
Likely I would be better with 2 channel audio via a usb soundcard and see how well a 2 channel conv-tas-net works and quantises on a Pi using the tensorflow toolset and already existing repos.
I doubt I will as sort of confused at where VA is going at the moment but you could prob drop the Xmos on a Pi but the AEC on the Xmos seems to work super well.
There is a lack of low level audio DSP in the community hence why the Xmos and libs is purchased in.
Ok, i hardly understand what was written, but i assume you won’t write anything.
Let’s keep this thread clean. The discussion cam be started in HA discord, or in dedicated topic here.
I am talking about the respeaker lite which is based on the Xmos chip and its libs.
If you don’t understand a comment about the very hardware in the title of a thread then maybe, you just should not comment. fwk_voice/modules/lib_aec at main · xmos/fwk_voice · GitHub (Acoustic Echo Cancellation) seems to work well on the xmos chips which doesn’t seem true with the linux libs of web_rtc or Speex.
Its got something to do with the Rtos of a microcontroller with exact routine times than a scheduled OS such as linux preempt does not.
The bit of magic of the farfield processing might actually be a tflite model that may run on a pi or once more might be better on a rtos.
We do seem to have a lack of audio DSP in the community and as said hence why the Xmos and libs are purchased in.
Talking specifically about this hardware where the ref designs shows it supports USB Audio Class 2.0 and USB Audio Class 1.0
This is exactly my point and why it should be here as rather than restricting the xmos316 through its I2S to the constraints of an ESP32-S3 running software that doesn’t care if on a RTOS or preemptOS it could use its USB interface where there are no constraints in comparison from a N100 Nuc, to Mac Mini to Pi USB Audio Class 2.0 and USB Audio Class 1.0 are cross platform standards.
You could even have multiple X316 devices on your choice of platform.
So you get the best of both worlds with the DSP that seems to prefer an RTOS and without the shackles of just another microcontroller such as the ESP32-S3.
It would still have the mics but on a daughter board by a 3.5mm jack and the headphone out still being a 3.5mm jack, but essentially look like a USB soundcard that comes with a wired 2mic microphone…
That way all these requests for further functionality are no problem as long running code already exists, by simply getting rid of the esp32-S3 and utilising the XMOS USB audio libs…
Reason why here is the ESP32 and microcontroller nerds and guru’s are the very people who could make the Xmos be a soundcard for Arm and X86 devices.
ESPhome if you wish make a constrained version using a ESP32-S3, do so, but if you are doing dev work with the XMOS maybe have a look at dropping the ESP32-S3 and using the XMOS libs so it is a USB device so it can be used on a wide range of much more powerfull devices…
So its not just for @formatBCE who doesn’t seem to know what the ‘Respeaker Lite’ is but also if there are any dev’s who do, maybe the USB device might catch favor…
What I hope/wish is that both Nabu Casa / Home Assistant developers and Espressif get behind and become a driving force for the new “Matter Casting” (“Matter Casting Client”) standard in Home Assistant, ESPHome, Music Assistant for standardized media playback on their Voice Assistants.
What Matter probably won’t do is enable multiroom music, a feature likely to remain at the ecosystem level. Additionally, don’t hold out hope that Apple and Google, and possibly Amazon, will enable their speakers as Matter device types. While it’s technically possible with Matter, controlling your Apple HomePod from your Google Nest Hub just doesn’t feel likely. I’d love to be proven wrong, though.
All three companies have been bullish about Matter as an industry collaboration, but I can see Apple and Google drawing the line at making their speakers interoperable with competitors outside of their ecosystems. However, Amazon may be more open to the idea, as it has already implemented Matter Casting for its Prime Video service as a nonproprietary alternative to AirPlay and Chromecast for streaming video.
So dunno as casting to different ecospheres, is a pain but likely to continue and for me that is what HA should be. A cast controller to enable different types and conversion as it does with so many device protocols but the actual media player should be the best of open source and again choice. I doubt many may opensource the cast source code, but guess we will have to wait and see.
I don’t have trouble in technical part of your thoughts. I struggle with your grammar and unfinished sentences.
Your theory is very… interesting… And I tried to use Respeaker Lite via USB, it works. What I imply is - there’s no practical solution to the problem you describe. No software, that will bind Snapcast, VA and Respeaker on Pi platform together, and connect that to the HA.
So since you claim to be pretty experienced in it, I’m asking to start making it instead of theoretizing.