The era of open voice assistants has arrived

I don’t have OCD to collect junk and electronic waste. Hopefully, you can read the screenshot, if you understand what’s written, as well as read transcript from the release video yourself.

I prefer being realistic over blind cheering and expect to get what was promised for my money.

That says today, (implied: under the applied constraints).

Very obviously one could do better hardware for a higher price even today.

2 Likes

@Molodax Just sell the one you bought on eBay or some other marketplace for used stuff. They are sold out everywhere so you should not have any issues with selling it to someone else who was not fast enough to buy one from the first batch and are now eagerly waiting for the next batch to be manufactured.

If want citations then start with the FAQ on the product page at https://www.home-assistant.io/voice-pe/ as here are just a few quotes that should not be open to open to your interpretation of that this product is intended to be and not be:

  • Why is this called the Preview Edition?: It is our vision to make open, local, and private voice assistants a reality in any language. While we have made great strides in realizing this, it is such a massive undertaking that we need our worldwide community to participate in its development. An essential ingredient for the community to drive the project forward is a standardized hardware platform for voice, built for Home Assistant from the ground up: Home Assistant Voice Preview Edition." While for some, the current state of our voice assistant may be all they need, we think there is still more to do before it is ready for every home in every country, and until then, we’ll be selling this Preview of the future of voice assistants. Taking back our privacy isn’t for everyone - it’s a journey - and we want as many people as possible to join us and make it better.

  • Can I play music on this device?: Yes, if you plug an external speaker into the 3.5mm audio port. The built-in speaker is meant for voice feedback and is not optimized for listening to music, but the included DAC is capable of playing lossless audio on a suitable external speaker. We recommend using Music Assistant to control music playback.”

  • Can this replace my Google Mini, Apple HomePod, Amazon Echo, or other Big Tech devices?: In the future, we intend to match and then surpass the Big Tech voice assistants, but for now, this Preview Edition can not yet do everything those devices can. For some, the current capabilities of our voice assistant will be all they need; especially those who just want to set timers, manage their shopping list, and control their most used devices. For others, we understand they want to ask their voice assistant to make whale sounds or to tell them how tall Taylor Swift is - our voice assistant doesn’t do those things… yet.

And again, the two main announcements also have loads of not-so-subtle sentences that try to conway what they want to achieve by releasing a preview edition of this new open-source hardware in order to start a new open-design movement. I will quote a few of them from https://www.home-assistant.io/blog/2024/12/19/voice-preview-edition-the-era-of-open-voice/ and https://www.home-assistant.io/blog/2024/12/19/voice-chapter-8-assist-in-the-home/**.:

  • Since we began developing our open-source voice assistant for Home Assistant, one key element has been missing - great hardware that’s simple to set up and use. Hardware that hears you, gives you clear feedback, and seamlessly fits into the home. Affordable and high-quality voice hardware will let more people join in on its development and allow anyone to preview the future of voice assistants today. Setting a standard for the next several years to base our development around.

  • The era of open, private voice assistants begins now, and we’d love for you to be part of it.

  • Our main goal with Voice Preview Edition was to make the best hardware to get started with Assist, Home Assistant’s built-in voice assistant. If you’re already using other third-party hardware to run Assist, this will be a big upgrade. We prioritized its ability to hear commands, giving it an industry-leading dedicated audio processor and dual microphones

  • Why Preview Edition: For some, our voice assistant is all they need; they just want to say a couple of commands, set timers, manage their shopping list, and control their most used devices. For others, we understand they want to ask their voice assistant to make whale sounds or to tell them how tall Taylor Swift is - this voice assistant doesn’t entirely do those things (yet). We think there is still more we can do before this is ready for every home, and until then, we’ll be selling this Preview of the future of voice assistants. We’ve built the best hardware on the market, and set a new standard for the coming years, allowing us to focus our development as we prepare our voice assistant for every home. Taking back our privacy isn’t for everyone - it’s a journey - and we want as many people as possible to join us early and make it better.

  • "Fully open and customizable: We’re not just launching a new product, we’re open sourcing all of it. We built this for the Home Assistant community. Our community doesn’t want a single voice assistant, they want the one that works for them – they want choice. Creating a voice assistant is hard, and until now, parts of the solution were locked behind expensive licenses and proprietary software. With Voice Preview Edition being open source, we hope to bootstrap an ecosystem of voice assistants. We tried to make every aspect of Voice Preview Edition customizable, which is actually pretty easy when you’re working hand-in-hand with ESPHome and Home Assistant. "

  • We also made the hardware easy to modify, inside and out. For instance, the included speaker is for alerts and voice prompts, but if you want to use it as a media player, connect a speaker to the included 3.5mm headphone jack and control it with software like Music Assistant. The included DAC is very clean and capable of streaming lossless audio.”

  • Community-driven: The beauty of Home Assistant and ESPHome is that you are never alone when fixing an issue or adding a feature. We made this device so the community could start working more closely together on voice; we even considered calling it the Community edition. Ultimately, it is the community driving forward voice - either by taking part in its development or supporting its development by buying official hardware or Home Assistant Cloud. So much has already been done for voice, and **I can’t wait to see the advancements we make together.

  • Conclusion: So many new innovations and improvements for Assist have happened in the past couple of months, and this speaks to the power of having good hardware to build our software on. Voice Preview Edition is the best open voice hardware available today, and even with it only in the hands of a couple of hundred people today, it’s making a noticeable difference. Whether that’s writing code, improving language support, making blueprints, or even just reporting bugs.The momentum we will build having this in the hands of thousands will be game-changing - it’s why we’ve declared that the era of open voice assistants has arrived.”

PS: Now I will ignore your future posts since to me you simply look to only be whining after buying a product without even reading the announcement or the product page which made you have very unrealistic expectations that you now seem to have a personal problem making an effort to understand and deal with this situation (and hopefully you will grow to grasp and accept that this is an inexpensive product as the $59 US-dollar price is not a lot today because the global economy is currently in a downturn due to the post-pandemic inflation and concurrent chip shortages + shipping crisis that has caused prices to been driven at a sharp rise in prices compared to pre-COVID19).

10 Likes

If you like tinkering, you can use the grove port on the bottom to connect an I2S DAC and amplifier for a speaker. If you want a finished product, you’ll just have to be patient.

EDIT:
Someone with 3D CAD skills, please create a base for the Voice PE to sit on that has a spot for a big ol’ speaker. That would be sweet

I don’t think Grove supports I2S, it does I2C but this is not the same. I2S have a L/R Clock, Data and the master clock for most dacs, which is 3 pins not 2.
Also current I2S pins are taken (?) so you would have to bit bang gpio…

There is a dac on the 3.5mm but its a shame its not wireless audio where a client would just select what stream to play as in snapcast
Then you have complete choice of placement, dac and speaker(s) and a choice of codecs that interfaces to standard linux audio.

While I agree that such features would be very nice to have, you have to understand that all such a feature has almost nothing to do with this hardware nor the ESPHome firmware that runs on it.

All such type of features would instead need to be implemented in Music Assistant (for the server/streaming side) as well as the media player integration component (for the media controls on the UI/dashboard side), not in the ESPHome client device as that is just a dumb audio-output in this case. So instead check out those, read these:

and

As well as

There you might by the way notice that Snapcast is already supported in Music Assistant and you might also notice that Slimproto a.k.a. Squeezebox) is an alternative open protocol that currently the best supported player/client for Music Assistant (again as a player for Music Asssitant, i.e. a media player streaming client).

However if you have more specific new ideas on improvements related to that then you should instead address ideas as feature requests to Music Assistant and the HA core/frontend teams. See here:

That is, it is Music Assistant that is the integration platform and glue between the audio source/provider and the audio player/client for all audio streaming, as Home Assistant more or less only acts as an interface for it (as well as optional automation framework):

1 Like

When it is a preview product as you so often like to point out, nothing is fixed yet especially opinion or where or who you can share that with, as this is supposedly open source.

The Snapcast issue brings up the point that much from the tensorflow model to items such as snapcast shows the limitations and questions was an ESP32 the right hardware, due to it limitaions.
If the Xmos board implemented the USB Class 1/2 drivers and a Pi5 or N100 made a complete smartspeaker the answer would be yes to all additions.
It would also open up the system to HA’s most common language of Python than the current bottle necks of some very advanced and complex micro-controller C routines.

The era of open voice assistants has arrived, but if it will confine and cause future problems by choosing an extremely limited platform to develop with.
Esp32 is superb for sensors and a whole of devices.
Get to the complexity of a locally based smartspeaker its no longer a really good fit.

If you wish to raise this issue in where you suggest then please do…
Slimproto aka Squeezebox is limited due to the constraints of ESP32 there ae some efforts to bring snapclient to esp32 which could work but a smartspeaker needs to be both server and client.
This means it can be a server to remote and itself and if local or remote is by just selecting a stream…

That will never fit on esp32 as much that could be will not either.

If you want a Raspberry Pi or PC powered version, that’s certainly possible; I’ve done three of those following the steps laid out by FutureProofHomes. However, those are not something that the average person that is the target audience of the Home Assistant Voice or Satellite1 would be willing to go through to get something set up. Not to mention that the Google and Amazon voice assistants are not really any more powerful than what Nabu Casa has released because most of the power behind them is cloud-based, not local.

If you can and you don’t mind enabling the usb audio class2 driver on a respeaker-lite please do.
Google and Amazon had a mixture of silicon of application processors and DSP microcontroller.
The original echo dot was actually Pi3 like but still massively more compute than a esp32. I think they included a https://www.cadence.com/en_US/home/tools/silicon-solutions/compute-ip/hifi-dsps/hifi-4.html microcontroller as they are probably the biggest hifi dsp supplier. They had hybrid custom silicon as they have economies of sale.

You need to do some teardowns as what you say is not true from 1st gen and especially of later that have more powerful application processors likely running NPU model based DSP than the original beamforming.
Both Google & Apple have complete offline solutions using AI accelerators with Google in the lead. As why I have a Pixel 6 phone as it was to test the new TPU and offline models to see what mobile SotA was at.
That level of tech that isn’t much more than a Pi5/RK3588 and likely less than a N100 Octa-core (2x2.80 GHz Cortex-X1 & 2x2.25 GHz Cortex-A76 & 4x1.80 GHz Cortex-A55 Pixel 6).
The Cloud is leaking $ badly and likely the only reason we haven’t a local big brand of that sort of tech is they are holding out to run LLM’s locally such as Gemini Nano that only runs on later Pixel Phones.

You don’t need the cloud to run a smart-speaker with ASR but a big problem is the adoption of a huge LLM-based ASR such as Whisper that isn’t that great for command sentences as its recorded accuracy is from the 30 second context and runs pretty badly even on a Pi5 or RK3588 board.
Why we have Whisper and why everything has to be HA refactors and rebrands than just using any of the existing and much lighter available opensource.
Same with using an LLM for control as with NLP there really should be no need for LLM levels of compute when NLP can recognise entities and has opensource with lightweight frameworks that do so.

A Pi5/RK3588 would currently make a great all-in-one smart-speaker that is extensible and plugs into your TV. If you want an LLM base then likely you will go for more compute.
The bit we have been missing has been a farfield microphone system and its such a shame this is still excluded when the xmos usb audio class2 libs are there just need some microcontroller people to implement them.
If that had been done the huge array of devices that accept plug&play audio devices from PC, tablet to Phone could have VA applications being developed for without anywhere near the huge restrictions in compute and storage a ESP32-S3 only allows.
Even a $15 Pi Zero2 has much more compute than a ESP32 and the only reason ESP32-S3 was chosen is because of ESP-home branding which is a shame as VA doesn’t seem to sit very well on it.

Send it to me! Please pay postage.

:wink:

1 Like

No worries, there is a very good recycling here :recycle:

I would wait for new firmware, that will likely be incremental for quite a time.

I’m pleased with it so far, I love how to-the-point it is and that it doesn’t try to be too clever and try to read between the lines like Google does. The only thing is I need something that speaks a little more loudly. I am over 70 and the ears don’t work like they used to.

I think a lot is an element of luck to how your voice fits into what mWW sees as positive KW.
Classification KW can be really strange on what might you might think is similar giving very different results.
I did quite a bit of experimentation with google-research/kws_streaming at master · google-research/google-research · GitHub which mWW is supposedly extracted from, testing the different type of models.
Is this repository more accurate then OpenWakeWord ? · Issue #28 · kahrendt/microWakeWord · GitHub
As its hard not to be critical if you run through the training method and actually do some simple listening and examination.
Piper creates 1000 synthetic KW in 2 American English voices of Male/Female with very little variation like overfitting the start of the dataset to a narrow pattern.
It then gets a bit wierd and applies reverberation via recorded RIRs all @ 1.5m but the weird bit is most of the RIRs are huge hall types, municipal buildings and even forrests, with some smaller more normal rooms but all at the same fixed distance.
So for reverberation the dataset is a single mic @ 1.5m which again is overfitting to a narrow strange selection especially when you have purchased in Speech Enhancement of a 3m Farfield Xmos chip with a mic array that supposedly does attenuate reverberation and extract voice.
The community is sort of blind as the tools/software/firmware to either use the Xmos USB Class 1/2 audio drivers have not been implemented and for some reason no-one has provided a basic always on wyoming streamer. That way at least you could capture data and analyse and have some idea of how well the Xmos algs work.
Things do get even stranger though as TF4Micro has some very rudimentary NS & AGC that is a poor version of the the duplicate processes on the Xmos chip. They must be poor versions as otherwise there would be little need to buyin the Xmos. Unfortunately they are upstream so the quality of the NS & AGC is set by the lowest common denominator.
Apparently there is no plan to turn off the at least wasted ops of TF4micro NS & AGC as when testing with a model where TF4micro NS & AGC was already trained in enabling NS & AGC was the most accurate… Turn off the microfrontend autogain and noise_suppression? · Issue #279 · esphome/home-assistant-voice-pe · GitHub Maybe test a model that has been trained with the options off…
The Dev team really do need to read up on the importance of the dataset and how current methods are literally riddled with basic 101 errors.
There is a hell of a lot of basic ML concepts available online but for some reason seem they have not been read. Ensembles de données: caractéristiques des données  |  Machine Learning  |  Google for Developers

The dataset is hugley important and small lightweight models can be extremely accurate but like the old saying of garbage in, garbage out the current datasets are extremely poor.
For some reason the easiest method and best method to actually capture data of use and allow a opt-in to publish to an opensource dataset is ignored.
The current initiatives to capture data are also deeply flawed Misunderstanding the problems with room reverberation (RIR) · Issue #11 · OHF-Voice/wake-word-collective · GitHub

One massively effective way to work with limited synthetic data would be to use on-device training, it would not be on-device as in the esp32-s3 but the N100 middle server or the Nabu Casa cloud.
Again much info is available On-Device Training with LiteRT  |  Google AI Edge  |  Google AI for Developers but if TF4Micro supports it is questionable whilst for application SoCs (Pis) TFlite is what is used as examples.
A small collected dataset is used to bias the weights of a larger pretrained model so the ML learns the environment of use and users to become more accurate.

Datasets from AI to voice control are the moats big data have over opensource and for some reason the collection of both KW and Command sentences still has little presence and the manner and lack of metadata will collate poor results. You can collate excellent demographic metadata without infringing on identity but hey.

Issues and infomation have been posted to how certain methods are flawed, how long or if ignored, I guess will be a matter of time.
Because of the constraints of the ESP32-S3, I have doubts if all can be accomplished and maybe it wasn’t the best platform to choose.
There are ways to create very accurate KWS via big gold standard datasets that likely have regional versions. That likely could be backed up by some form of on-device training or merely OTA shipping of newly trained models with ever increasing captured datasets.
Time and the Devs will tell though and some peoples, voice will work and some others will be honestly tearing thier hair out at what they think is ewaste as the dataset currently is overfitted, without doubt.

If you want an USB Audio far field device that badly, I suggest to get one - they exist in cheap and nearly useless to pricy and phenomenal - generally sold as hands-free or full-blown conference audio solutions. The way I read it, even Seeed itself offers one - with more mics than the hat to begin with.

In the mean time, I do not think most around here would be ok with throwing Cortex A76s, much less intel cores at voice assistant duty. I know I am not. Perfectly happy to have an esp32 monitor wake-words and then offload everything else to a proper multi-modal frontier LLM, really…

1 Like

Got mine finally a few days ago, it’s good but the wake word recognition is no way near Alexa, not even close, also I find it frustrating that it doesn’t automatically lower the volume of the speakers in the same area when voice is recognized. I tried creating an automation but doesn’t seem to work yet. That said it’s fine.

I successfully made an automation that mutes the receiver when the living room assistant triggers.

I was trying to lower the Sonos player volume when I speak to the assistant just like Alexa does

Can you pause the Sonos instead? That’s a useful workaround.

One observation here
It isn’t a problem with the voice PE but with your automation to lower the volume.