Using Mic without Push to Talk

Honestly just wondering how possible this is.
My original plan was to hide the ESP behind my screen and just have the mic peeking over so it could pick up my commands, but all the projects I am seeing are for push to talk. Just wondering if it’s possible to do a wake command (though not sure an ESP would have enough brain power for that, but I’m no expert)

Wake word detection may be possible on an ESP32, however there is significant engineering work required to collect training data (i.e. MANY thousands of voice recordings), process the clips as training data into a model, then shrink the model into software capable of being deployed onto low-CPU edge devices to run continuously.

It is a classic cost / complexity / quality trade-off, just as has been recently demonstrated with TTS and STT (high quality needs big hardware)

So - may be possible, but not for some time, and small CPUs are likely to limit the quality (i.e. accuracy giving both false and missed triggers).

Using an ESP32 with a mic array and streaming voice to a larger device (Intel NUC perhaps?) running the voice models works for STT where you’re using push-to-talk (only record when PTT), but could make real mess of your network if attempted for wake word detection as it needs to run continuously 7x24.

FOSS projects like Mozilla Common Voice are collecting voice samples to help open projects train models, but I’m not sure what Nabu Casa is planning. I’d expect a project to donate wake word training data (e.g. record lots of different voices saying “Hey NAME_GOES_HERE”). built an open-source wake word detection system that works on a RPi3, however this took (from memory) about two years, and the result is specific to their wake word. I spent some time donating voice samples to Mycroft, but sadly they ran out of money (stuff like beating a patent troll) and their second hardware device crowd-funding attempt failed (I personally lost several hundred pounds).

It is no co-incidence that Michael Hansen was employed for a while by Mycroft, before creating Rhasspy, and joining Nabu Casa.

If this helps, :heart: this post!


Other way round, rhasspy, then mycroft, then NC. I believe.

Wake word detection is in the wings. Have patience.

1 Like

Wake word is generally not that complicated.
It is just a pattern match of a wave piece.
No complex training needed.

The issue here is that the pattern matching is processor heavy, because it will have to check all the time for the pattern.
That will multiple times each second.

If you look at this post by the willow developer, you will see that he has quite a contrary view to yours. Willow Voice Assistant - #17 by kristiankielhofner

The newer ESP32 is close to be able to handle it, but not quite there yet.
The issue is the continous scanning for the wake word pattern and especially in a noise filled area.
You need to do this real time, beause if you can not do that, then your queue will keep growing from the continious recordings.

Willow have optimized the process a bit, but the loss is flexibility with multiple wake word systems.
And willow does still not seem to do any extra features, like AEC, beamforming and so on, which will require even more processing power.

The ESP BOX uses the newer ESP32 S3.

Willow uses the absolute latest ESP-SR framework with their Audio Front End Framework. We place AFE between the dual mic i2s hardware so that all audio fed to wake, on device recognition, and audio streaming to inference server has:

  • AEC (Acoustic Echo Cancellation)
  • NS (Noise Suppression)
  • BSS (Blind Source Separation)
  • MISO (Multi Input Single Output)
  • VAD (Voice Activity Detection)
  • AGC (Automatic Gain Control)

Additionally, the ESP BOX enclosure has been acoustically engineered by Espressif with tuned microphone cavities, etc. Because of this functionality, ESP-SR has actually been tested and certified by Amazon themselves (I see the irony) for use as an Alexa platform device.

Wake word is instant, as in imperceptible, and the VAD timeout is currently set to 100ms. We have a multitude of ESP-SR and ESP-DSP tuning parameters for any of these features. Also, while it is the same engine, we use the Alexa, Hi ESP, and Hi Lexin wake words, which have been trained and recorded by Espressif and professional audio engineers on 20,000 speech samples across 500 individual speakers (mix of genders, including 100 children) at distances of 1-3m. For each wake word.

We will be using this process to train “Hi Willow” and other wake words as it makes sense.

In looking at this process (which is pretty much industry standard for commercial grade wake word implementations) the wake training process is, in fact, very involved. You can see the metrics of wake activation, false wake, resource utilization, etc here:

We have reliable wake word detection across all of our supported wake words and clean speech for speech recognition from at least 25-30ft away (even without line of sight to device - around corners, etc) in acoustically challenging environments (acoustic echo, noise, etc). You can see from the benchmarks above Wakenet activation is 98-94% reliable depending on environmental conditions, all while minimizing false wake.

The ESP32 S3 has two cores and we assign AFE and audio tasks to Core 1 with different levels of FreeRTOS scheduling priority, while leaving “misc” assigned to Core 0 with different level of FreeRTOS scheduling priority depending on the task. We currently have plenty of CPU time across cores to spare for other future tasks and we will be able to optimize this further.

Combined with our inference server or on device command recognition (via ESP-SR Multinet 6) we have response times and interactivity that is actually superior to Echo/Alexa because of local control. Please see the demo:

Since recording that on Monday we have shaved off another ~200ms or so in the recognition → transcript → send to HA pipeline.

Additionally, we have an initial audio test harness and have verified this pipeline across at least 1,000 consecutive runs with approximately 1% failure rate.

All in all, not bad for a $50 device you can just take out of the box, flash, and put on the counter :slight_smile: !


Replying to myself because I’m new here and can only post two links at a time. More details on the wake word training process:

1 Like

I wish I could stop believing US$ amounts as relevant to myself. Repeat after me, with conversion to NZ$, addition of GST and shipping, $50 will be $100. Bought 2 anyway.

I use USD because I happen to be in the United States and it is impossible to determine duty, tax, local markup, currency conversion, etc for every country in the world. Another issue not often addressed - final and complete WiFi devices are regulated and certified by individual governments around the world, and this complicates matters even more. Luckily Espressif has already done this and established worldwide distribution channels with the required national and international certifications.

Question for you (as a Kiwi) - what is the local cost of a Raspberry Pi 4 with dual microphones, speaker, capacitive touch LCD display, power supply, and ready-made enclosure for all of the above?

Glad to hear you picked some up anyway :)!

Oh I am not pissed off at the price, I just get excited at the concept of a $50 toy, and by the time I have bought one, and a spare because I usually bugger the first one, I have to explain $200 on the credit card to she who is sick of unfinished projects.


Compare $50 with $399 for the Mycroft Mark II! (after failed crowdfunding), although the internal processing power is different:

The Mark II has a RPi4 which should allow local STT and TTS (albeit slowly…), whereas I suspect the ESP BOX likely needs cloud support like the Mark One does?

I’ve seen the ESP Box for ~£50 + shipping + VAT + duty on AliExpress.

Some good news for you - we have done terrible and awful things to ESP Boxes in the course of development for Willow! We have yet to brick a single device (and the team has at least 15 of them). They are EXTREMELY resilient to flash failures, etc.

1 Like

The Mycroft story is a frustrating one. Over the course of my 20+ year career in this field I tend to see the same mistakes being made over and over again. I say mistakes because I’ve made all of them and more myself!

They took on WAY too many hard things WAY too early (and all at once), and the story and final result is a reflection of that. I’ve seen these movies before and they never end well.

The ESP BOX always uses local processing for wake word and AFE/DSP. However, speech recognition runs in one of two user configurable modes:

  1. Local. When local command recognition is selected, we pull the friendly names of entities from Home Assistant and dynamically build the grammar required by the on device Multinet 6 speech recognition module. It supports a maximum of 400 commands (currently) but in terms of the hardware, model, DSP, etc this isn’t necessarily a hard limitation although we heavily enforce it now because that is what we have tested with. In this mode speech/audio never leaves the ESP BOX itself.

  2. Willow Inference Server. Next week we will be releasing our highly optimized WIS implementation so users can self-host. This is what powers the best-effort Tovera hosted speech recognition server we provide by default. In this mode, as soon as wake is activated we being to stream audio (after DSP processing) directly to WIS in real-time. When voice activity detection detects end of speech we send a small end marker and WIS takes the buffered audio and performs speech recognition, with the results sent to the device. This enables extremely low latency, high performance, and extremely accurate speech recognition of any speech across more than 40 languages. We provide the detected language to Home Assistant so you can walk up to Willow and provide speech in any of these 40 languages (without extra configuration or prior knowledge) and it will send the output to Home Assistant with detected ISO language code, complete with UTF-8 encoding of various character sets.

In both cases the speech transcript is sent to the Home Assistant pipeline or conversation API over Web Sockets or HTTP REST (with or without TLS) depending on the version of Home Assistant, component, transport, etc we detect.

So, with Willow and WIS you can say things like “Put an entry on my calendar for lunch with Josh at 2pm on Wednesday May 22nd at Napoli’s Pizza in Chicago, Illinois” and as long as your HA intents can process it you’re good to go.

1 Like

Oh wow this got way more action than what I expected!!

Very interesting! I’ll definitely be checking out Willow!

Thanks for the info - the mix of local DSP sound processing for wake word and clarity, backed by a much more powerful STT and intents engine makes a lot of sense. It reduces the cost of the front end devices, and allows many to connect back to one central server resource.

Personally, my preference is for a local only architecture without the need for cloud services - but obs. that accepts the need to self host. What will be interesting is if the “back end” can be optimised down to something like an i5-class commodity desktop or micro-server, rather than a full i7-class gaming rig with a high-power GPU (don’t know Team Red as well - sorry).

The cloud service voice platforms must really have to optimise their pipelines to remove latency, as just moving packets up and down across the WAN to the and cloud must add many tens of mS putting the architecture behind self-hosting.

Local mode is all local on the ESP BOX itself.

Server mode uses our Willow Inference Server and we will be releasing it next week. We didn’t want to release both simultaneously because we are a small team and the response from Willow alone has been very overwhelming and we’re struggling to keep up with incoming as it is.

The Willow Inference Server is for self-hosting in server mode and you can put it anywhere. There is one “gotcha” of sorts. Our goal is to be the best voice user interface in the world and beat commercial offerings in every way possible. To do very high quality speech recognition with the sub 1s latency we target today that means GPU. I can assure you Amazon isn’t using CPUs for Alexa!

As an example, the most highly optimized CPU-only Whisper implementation is whisper.cpp. You can use it on the fastest CPU on the market and a $100 six year old GTX 1060 or Tesla P4 beats the pants off it - at a fraction of the cost AND power. GPUs are very different in terms of fundamental architecture and are significantly more well suited to tasks like speech recognition.

The Willow Inference Server can run CPU only but for Alexa quality and user experience speech recognition you will be waiting a long time for text output, and the benefits of a locally hosted high quality voice interface diminish considerably when you’re waiting three, five, or even 10 seconds or more for a response. You could try using the lighter models we offer (base) but the quality will be significantly lower - or it may work just fine for your purposes. It will still be very “slow”.

Here are some early benchmarks for the (highly optimized) Willow Inference Server across various GPUs:

Device Model Beam Size Speech Duration (ms) Inference Time (ms) Realtime Multiple
RTX 4090 large-v2 5 3840 140 27x
H100 large-v2 5 3840 294 12x
H100 large-v2 5 10688 519 20x
H100 large-v2 5 29248 1223 23x
GTX 1060 large-v2 5 3840 1114 3x
Tesla P4 large-v2 5 3840 1099 3x
RTX 4090 medium 1 3840 84 45x
GTX 1060 medium 1 3840 588 6x
Tesla P4 medium 1 3840 586 6x
RTX 4090 medium 1 29248 377 77x
GTX 1060 medium 1 29248 1612 18x
Tesla P4 medium 1 29248 1730 16x
RTX 4090 base 1 180000 277 648x (not a typo)

So you can see from this a Tesla P4 can do 3.8 seconds of speech with the > 40 language Whisper medium model in 586ms - whether self-hosted locally or over the internet that easily meets our < 1s latency target.


Why not TPU? I got the dual core coral m2 thing and currently only using one core for frigate so would love to have something that I can use that second core for.

For something like the coral m2 I find them interesting, and I’m not necessarily opposed to supporting them eventually, but it’s important to understand just how computationally demanding extremely high quality speech recognition in sub one second response times is.

We have heavily optimized our inference server implementation and quantized models to 8 bit. The > 6 year old Tesla P4 can do 22 TOPS in int8 (an RTX 4090 is 145 TOPS). The coral m2 specs claim a maximum of 4 TOPS, but the models and ecosystem are nowhere near as optimized or efficient as they are in something like CUDA land so I’m sure an apples-to-apples comparison is impossible. I’d be surprised if in an apples-to-apples comparison the coral m2 came out to be half what they claim in real world for this application.

Long story short, we’re not focusing on devices like the coral m2 because they fundamentally can’t provide the kind of user experience we’re designing for.