ESP32-S3-Box3 Owners rejoice! Audio Duplex, 'Stop' Halt word and more are now possible!

Maxi1134 · June 11, 2026, 9:31pm

I just spent 1.5Millions Fable 5 Tokens on this, with over an hour of our dear Claude thinking.

Now, I know we dislike LLM Vibe-Coded stuff here, so if you simply don't want to engage with these type of generative solutions, continue your way.

But if you want the goods, and I mean, the good goods!

Full Audio Duplex, which comes with:
- A 'stop' halt-word to stop an ongoing Text-To-Speech answer
- A 'Barge-in' mechanism, where repeating 'Okay-Nabu' erases what you said and starts a new
  listening session
- VAD to avoid those pesky false-positives (Voice Activity Detection )
- Real-time AEC (Acoustic Echo Cancellation)

Gain control for the Microphone and Speaker
- Allowing for a louder voice (I recommend a gain of 4)
- Allowing for a more sensitive microphone ( Guys, I can whisper okay nabu and it picks it up!)

Then head to the YAML file on my repo to grab the full code.

And be sure to install the ESPHome-Intercom components locally if you select another version than the one linked (Ending with _cloud).
Those two others in the same folder use a local cache of the Esphome-Intercom components!
And use the little demon images you see on the video

Here a video for those wondering of how it works (I apologize for my slow local inference time!)

As for results?
It just works.

I have been using a previous version of this code for over 2 months now! With no issues.

Credit where it's due, this is all possible thanks to n-IA-hane who made the ESPHome-intercom component!

esand · June 11, 2026, 11:23pm

Probably a long way away from me being able to use this, but one of the features I really liked about my Google powered little speaker clock was being able to tell it to "stop" - so this is definitely nice to see! Nice work!

Maxi1134 · June 12, 2026, 12:00am

I agree! I had a system, that is still present in that code, where the screen can be tapped for toggling the listening state, either stopping it or triggering it.

But it's nowhere as effective! And it requires going to the device