The era of open voice assistants has arrived

guix77 · December 20, 2024, 9:52am

This 100% opensource platform designed to be heavily extended is really what we needed to get to the next level. Hardware & code possibilities with ESPHome are so wide that we really lacked a common starting point. Lossless audio streaming really is the cherry on the top of the cake. Great vision and outstanding achievement !

OK Nabu. Take my money, send me the PE and just leave and take a well earned Christmas holidays now!!!

finity · December 20, 2024, 9:58am

skynet-network · December 20, 2024, 10:01am

Build in a 3.5 audiojack and a streaming player otherwise it not intresting for me

Good paring with musicassistant

Lakini · December 20, 2024, 10:02am

It has one.

Hedda · December 20, 2024, 10:03am

Can someone please make a design as replacement PCBs for all Google Nest / Google Home speakers?

That is, now that the production-ready IC components is both finalized and the PCB design/schematics being open source, using all this as the reference hardware design I hope some people in this community are skilled electrical engineers and interested in making replacement PCB designs with open source schematics for existing smart speaker products so we can retrofit/convert them into becoming ESP32 + XMOS hardware running ESPHome firmware.

I mean, would love the option to just swap out of the circuit board internals if could repurpose most existing Google Nest / Google Home and Amazon Echo smart speaker hardware, as many of those already have nice enclosures and good enough built-in speakers built-in to play music for multi-audio (at least if you are not too picky about your Hi-Fi audio quality).

Similar to the Onju Voice project which previously released open source PCB schematics for Google Nest Mini / Google Home Mini speakers PCB replacement, with updated PCB designs requested when it became clear that XMOS was going to be used (in combination with ESP32-S3):

github.com/justLV/onju-voice

[REQUEST] New PCB with both an ESP32-S3 and an xCORE chip from XMOS for advanced audio processing as a ESPHome compatible voice-kit?

opened 11:24AM - 07 Jul 24 UTC

Hedda

Any chance someone with PCB engineering skills, time + interest to re-design a n…ew/updated custom "Onju Voice" PCB for Google Nest Mini and Google Home Mini speaker series that includes xCORE xCORE DSP (XU316-1024-QF60B-C32) IC chip from XMOS? * https://www.xmos.com/processor-catalogue/ * https://www.xmos.com/download/XU316-1024-QF60B-xcore.ai-Datasheet(3).pdf Not sure if all of you have followed news about Nabu Casa's upcmoing voice-assistant hardware project, but just heard Paulus Schoutsen reveal on their [Home Assistant's ESPHome Summer Release Party on YouTube](https://www.youtube.com/watch?v=hbcGz3ZlUX4) that Nabu Casa's ESPHome developers are working on a new open-source "Assist Satellite" development platform with matching open-source reference hardware for their voice-assistant products that will based on ESP32-S3 in combination with a very powerful [XMOS xCORE DSP chip](https://www.xmos.com/develop/xcore-voice) for advanced audio processing), with the ESP32 running ESPHome firmware with many new voice, speaker, and media player features that are in the works (or at least in the theoretical planning stages). Update! Check out ESPHome + Nabu Casa developers are experimenting with custom Voice Satellite firmware for ESP32 + XMOS: * https://github.com/esphome/home-assistant-voice-pe * https://github.com/esphome/voice-kit/ Some other independent ESPHome developers are by the way also working on related audio component enhancements, see links: * https://github.com/esphome/feature-requests/issues/2859 As I understand it, once new feature/function components or enhancement/changes to existing ESPHome components have been tested enough there they plan on back-port those upstream to the main ESPHome repository for mainlining them. The goal to create a single, standardize and homogeneous development environment for Home Assistant's "Assist Satellite" (remote voice satellite hardware for voice control of Home Assistant): * https://github.com/home-assistant/architecture/discussions/1114 * https://www.home-assistant.io/voice_control/ Anyway, XMOS xCORE DSP (Digital Signal Processor) chips that are designed to use I2S (Inter-Integrated Circuit Sound) interface for high-speed powerful interprocessor communication to work like a low latency and performance sound/audio co-processor for microcontrollers like ESP32. * https://www.xmos.com/xcore-ai For the use case for ESPHome Voice Satellite Development is to add in-line off-loading of audio noise removal (voice clean-up) from the microphone(s) input, like Interference Cancellation (IC), Acoustic Echo Cancellation (AEC), Noise Suppression (NS), and Automatic Gain Control, etc. and/and other audio post-processing algorithms to improve the solution's voice recognition capabilities). As well as similar in-line off-loading to multichannel audio output to speakers. Depending on which XMOS chip they use their XCORE-VOICE framework could technically also allow for up to 16 PDM microphones to be connected to a single xCORE device with a different PCB design). * https://www.xmos.com/develop/xcore-voice * https://www.xmos.com/usb-multichannel-audio/ UPDATE: News about the upcoming ESPHome voice-kit hardware platform by Nabu Casa is also Home Assistant roadmap, and that ESPHome-based voice-kit hardware has since also mentioned by Mike and Kevin during the Voice Chapter 7 livestream: https://www.home-assistant.io/blog/2024/06/12/roadmap-2024h1/#current-priority-2-make-assist-easier-to-start-with _"**Current priority 2: Make Assist easier to start with**" ... "**we’re exploring building our voice satellite hardware to create a more plug-and-play experience.**"_ https://www.home-assistant.io/blog/2024/06/26/voice-chapter-7/ I'm paraphrasing but one of the representatives that is working on that "***voice-kit***"project more or less wrote there that while ESPHome and Home Assistant voice developers from Nabu Casa are now focusing to work on voice assistant features they are still figuring all this around making use of external audio processors and while they are currently only testing the XMOS xCORE chip as a candidate for an ESPHome-based voice-kit reference hardware design for official Home Assistant Voice Assistant development kit they also plan to work on "audio processor" component for ESPHome with hardware-independent architecture that will not be reliant on specific hardware configurations or dependent specifically on the XMOS xCORE DSP chip but instead allow others to add support for additional DSPs as audio processors (i.e. sound co-processors) in the future, (plus the fact that they will make it so that all the I2S settings and pins are still configurable in YAML, meaning that it should at least be possible add support for DSP types to the "audio processor" component if they work similar to XMOS xCORE DSP chips, as well as different board designs that uses other I2S settings and pins). That representative also wrote; "_**we will add all the code to the base ESPHome project once things are stable and working well**_".m and noted that ESPHome and Home Assistant / Nabu Casa developers are right now moving very fast and breaking things as they go so working on code for the new voice-kit related components for ESPHome in a separate repository on GitHub here: * https://github.com/esphome/voice-kit/ By the way, I think similar chips from XMOS like their XU-316 AI (xCORE XU316) is by the way used in Amazon Alexa Voice Service (AVS) Development Kit(s), and is used in some Amazon Echo products as well as other popular : * https://www.xmos.com/develop/xcore-voice * https://www.xmos.com/xk-voice-l71 * https://www.xmos.com/xmos-delivers-first-amazon-alexa-voice-service-development-kit-with-linear-mic-array-for-far-field-voice-capture/ * https://www.xmos.com/fully-offloaded-giving-smart-tvs-the-voice-power-they-deserve/ * https://www.xmos.com/making-smart-speakers-feel-at-home/ * https://developer.amazon.com/en-US/blogs/alexa/post/edde861e-65a4-4319-b5f1-422b0c626673/xmos-expands-device-types-supported-by-dev-kits-for-avs-with-a-far-field-linear-mic-array-based-reference-solutio * https://developer.amazon.com/en-US/alexa/devices/alexa-built-in/development-resources/sdk As far as I can tell the complete source code for XMOS's xcode-voice firmware is available on Github under sln_voice repo: https://github.com/xmos/sln_voice More information about that in their user-guide for their XK-VOICE-L71 Evaluatuion board: https://www.xmos.com/download/XVF3610-User-Guide(v5_7_3).pdf Since Nabu Casa's designs it said to be open-source hardware and XMOS integration will probably be added to the ESPHome's Media Player Components (and Microphone Components) I for one am hoping that it could and will be extended to different types of speakerless solutions with appliance solutions with AUX-output/audio-output and AUX-input/audio-input port and not only for voice-assistant. Personally I would also love to see inexpensive speakerless network-streamer player/receiver hardware without microphones but only with with AUX-out that can connect to any of your existing amplifiers or speakers with built-in amplifiers in order to replace products like [Chromecast Audio](https://support.google.com/chromecast/answer/6279371?hl=en) and [Amazon Echo Input / Echo Link Amp](https://en.wikipedia.org/wiki/Amazon_Echo#Speakerless_devices), (e.i. devices with no on-board speakers that must be connected to external speakers for audio output (AUX-output). That is, I am sure that not everyone only wants "smart speakers" with voice-assistant and that instead many would be also happy to have network streamers/players without microphone which only purpose is to receive and output highest quality audio possible from Music Assistant to your "dumb" speakers. I for one still have loads of [Chromecast Audio](https://support.google.com/chromecast/answer/6279371?hl=en) audio-only receivers connected to various models and brands of different speaker/reciever systems in each room used to achieve multi-room music playback on a budget (because could not afford Sonos speakers in all rooms). So even if though Nabu Casa's hardware will initially primarly be designed for "Home Assistant Satellite" (also known as "Wyoming Satellite") for voice-assistant appliances, such open-source hardware it just like the ESPHome firmware does have a lot of potential for different use cases. Also on my wishlist if a network streamer receiver hardware with AUX-input and ADC to get music from analog audio source. As an easy way to achieve a remote AUX input into Music Assistant from an external analog audio source like a vinyl record player (LP turntable) or cassette player. What I want to achieve is a solution that is easy to install/maintain and use that allow my wife to stream music from a vinyl record player (LP turntable) to any speaker or group of speakers in our home. The vinyl record player (turntable) setup she has a pre-amp with phono (RCA) output ports for analog audio in stereo. * Architecture example: Analog audio source with preamp -> ADC network appliance -> music stream -> Music Assistant -> Any speakers I would therefore prefer if we could buy some kind of networked (Wi-Fi) enabled appliance like a music streamer with stereo AUX input port that it will use for on-the-fly perform analog-to-digital conversion (ADC) + encoding for streaming to a Music Provider inside Music Assistant. I do however think that both such a solution does need its own non-propriatory audio-only streaming protocol for high-quality music streams?

stuartiannaylor · December 20, 2024, 10:11am

Really opensource should be interoperable with already existing opensource software.
There are 2 great pieces of opensource wireless audio software, squeezelite which runs on esp32 like RASPIAUDIO · GitHub or GitHub - badaix/snapcast: Synchronous multiroom audio player which runs on a Pi.

Squeezelite is more limited than Snapcast as Snapcast is a full blown opensource Sonus challenger for wirelesss multichannel audio.

You place your speakers in the best place for speakers which usually is a stereo pair on a facing wall giving room coverage.
This allows your microphone to be optimal and close and away from your speakers, but not cloning always far more choice as not only are they your smart speakers they can be cast to by any device you set up with opensource casting software.
You can pick what amplifier you wish and if each speaker is active wireless or a reciever may drive several speakers.

My setup is snapcast with a Pi that needs no enclosure as its stuck on the back of a subwoofer I got from ebay for £20 and there are a whole load of very cheap but amazing quality as class D amp boards have improved so much.

If you want a liitle more quality then *December Promotion* WONDOM OFFICIAL SHOP - Amplifier Board - Sure Electronics - ADAU1701 - 18650 charger - Sigmastudio make some great audio boards.
I have 2x bookshelf speakers which again where 2nd user ebay buys as some great bargains can be made.

Not embedding a speaker creates choice and opens up to other devices that can cast to them so those speakers can be the output for all room media not just a ‘smart speaker’ …
Also makes enclosure design much easier as the engineering that goes into the Google and the rest is actually immense, you can check out a Nest audio and its ridgid cast metal body to stop resonance in its casing to help isolate speaker from microphone array. https://www.youtube.com/watch?v=4-3VodA-Nlo

Seperating microphone just makes software and engineering needs so much easier, enclosures… stick your amps to the back of your speakers on hex pillars and feed from a 24v brick PSU…

tomas1 · December 20, 2024, 10:26am

I might have misunderstood it, but how does the timers work? It runs on the device itself? Because that is the biggest problem I have with timers in Alexa - thy run on that device. Is it possible to say “set tea timer for 2 minutes” and get it to start actual timer entity in HA, so I can display it on dashboard, and have it announced in whatever room I’m currently in, instead of the original device?

pimw · December 20, 2024, 10:32am

Congrats with the product release! I’ve ordered one.

Question:

i’d like to use it local controlled only
the only thing i wanna do, is for the assistant to execute max 10 scripts
no interest at all in llm and all the fancy stuff
English language is fine

Is it reasonable to do this on a Intel i5-10500, 2 cores being exposed to Home Assistant OS, without (i)gpu?

Lakini · December 20, 2024, 10:37am

They added more files over night, this page is pretty extensive by now: Downloads – Home Assistant Voice Preview Edition

BeastHouse · December 20, 2024, 11:04am

Bummer they must be out of stock now!

stuartiannaylor · December 20, 2024, 11:22am

Its sort of stange as the audio out is being returned from the upstream ASR when really the upstream ASR and intent response could just stream to wireless audio.

A pi with a respeaker 2 mic can do the same, but dislike the lag the driver software has from each new version of RaspiOS and prefer using stereo USB soundcards such as Plugable USB Audio Adapter – Plugable Technologies
or ADA-15 USB - HQ MINI audio | Axagon

I am not really a fan of how Mike sets out the voice infrastructure as yeah audio out is central and likely should not need to be in a microphone enclosure.
The Python wyoming ‘open standard’ is freshly created whilst Linux has a huge array of high performance C libs for audio and doesn’t make a lot of sense for me at least when we have ALSA to Pulse and the newer pipewire, but you can just pipe to a network socket if you wished which again uses high performance existing linux libs than Python creating unnecessary load for embedded…
But when you have great opensource wireless audio such as Squeezelite or Snapcast it makes even less sense to me.

I have been following Rhasspy and Mycroft from early days and have a repo at StuartIanNaylor · GitHub but thinking of starting again with LinuxVoiceContainer · GitHub just to create some tutorials on how to DiY and use some of the already existing 1st class high performance audio libs Linux already has to offer.
Building a beamforming microphone array on a Pi Zero2 or Radxa as I think I can do better with opensource than opting out to closed source hardware such as the xmos…
Next couple of days I will be making some vids and tutorials on LinuxVoiceContainer · GitHub as an alternative to the HA offering as the implementation often has me bemused.
Only little things but they add up as with stereo beamforming generally you have a front facing device, where the enclosure itself acts to attenuate from the rear.
Top up as with HA with 2 mics on top the beamforming is only on the x axis as three mics is the minium in a triangular config to also include the Y planar axis also…
Guess you could use the HA unit on its side but the wheel and button doesn’t lend itself for that in the manner its been constructed.
In fact why have a wheel and button for a voice input… and again bemused
Also why use Whisper as its huge and not that great for command sentences and why are we waiting for HA ASR when so much existing ASR is already production proved.
HA is a great piece of opensource automation control software for near all home control devices and protocols.
I am confused why like Google and the rest they seem to be making there own embedded brand of everything from ASR, TTS to wireless audio when so much already exists in the opensource arena.
My current favorite for ASR is GitHub - wenet-e2e/wenet: Production First and Production Ready End-to-End Speech Recognition Toolkit as its massively lighter than Whisper and can run on much lighter hardware or be a central ASR on a multiclient system where recognition latency is very small the more hardware you throw at it.
Its all been really frustrating as opensource does have competing software but has nowhere near the levels of discipline in the datasets bigdata have to train the opensource software and this is still true.
I am not sure why more isn’t focussed on create true highquality large datasets and new language models are created for existing than refactoring and creating own brand modules…
But hey…

mp583 · December 20, 2024, 12:01pm

Excellent! It looks really cool, I like that I can plug it into another speaker for music. I imagine including a high quality speaker at this point bumped up the price too much. Excited for the RGB ring light too.

I think it needs a more friendly name though, something like Harvey - H(ome)A(ssistant)rV(oice)ey?

CJB · December 20, 2024, 12:50pm

Great! But please don’t forget us in New Zealand as well.

Tylast · December 20, 2024, 1:42pm

Is there a way to have an intercom feature between 2 of these devices?

domain_int · December 20, 2024, 1:49pm

I know its a pipe dream but it will make me purchase 10 of these, can they sync music???

praying its a yes.

domain_int · December 20, 2024, 1:49pm

This would be cool

stuartiannaylor · December 20, 2024, 2:13pm

Don’t think so and its a shame it doesn’t use existing wireless audio opensource software and just be a client to one of those.
Not sure how much resources are left on the ESP32-S3 but squeezlite has been ported to Esp32 as in https://raspiaudio.com/ whilst full blown Sonos opensource Snapcast has much tighter sync that feathers the time sync with zero glitches, but a Pi Zero is a minimum with a Pi Zero2 prob being a better bet as huge step up for only £5 more.
Squeezelite and Snapcast are great pieces of wire audio opensource, one for lighter hardware (Squeezelite) and the other you could argue its even better than Sonos with tighter sync and up 96Khz multichannel if your hardware can cope but its all written in high performing C and supposedly still runs on the original Zero.

Mosher · December 20, 2024, 2:20pm

Is there a way to improve response speed of the voice assistant by adding some hardware accelerator like Hailo AI?

HVR88 · December 20, 2024, 2:30pm

MUTE only means one thing - to stop something from producing sound. Maybe someone would use the word to imply silencing a mic, but that’s incorrect usage of the word.

ndom91 · December 20, 2024, 2:42pm

One of the annoying things to me on the S3-Box-3 was that when timers triggered and the sound played, it could only be stopped by physically pressing the button on the box.

Is it possible to stop the timer alert sound via voice now?

For example with Google Asisstant you set your timer, it triggers and the bell rings, and then you can just say “stop” to have it stop ringing.

I’d love this sort of functionality here.