Using Mic without Push to Talk

Ah ok that’s fair enough.

Is there a resource for checking TOPS? I have a jetson that’s collecting dust and wouldn’t mind giving it a job.

Nano?

Sadly they make the coral m2 look great. They’re rated at 472 GFLOPS, which is… Not good:

1 Like

HAHA yes, boo well that’s no fun.

I played with it for about a month and retired it pretty quickly.

I still have my Nano but it’s been collecting dust for a while :frowning:

I was able to use Porcupine to detect a wake word and send the audio to a voice assistant pipeline on a raspberry pi 4 with a USB mic.

Bit off topic but saw this tinyml project on Arduino Nano which listens for dog barks which is pretty cool.

Hi, haven’t tried the project in this link but from the youtube video it seems to work very well. Would be great if we could integrate this into esphome somehow. The link below is a demo project of a DIY Alexa

atomic14/diy-alexa: DIY Alexa (github.com)

For someone who wants to implement something: tensorflow micro has a keyword spotting example that works very well. Also espressif’s audio framework (esp-adf) can do that. The adf isn’t open source, though.

Regarding the resources needed: Some people say it’s barely doable: I tested tensorflow micro and i think it uses a fraction of the esp32’s computing power. Let alone the second cpu core. It basically consists of a small neural network. And i believe it’s precise enough. (However i haven’t had any decent microphone to do a proper test)

Congratulations. :rofl: After physically downsizing ALL the computers in my home, I’m buying a full-size motherboard, processor, case, and power supply for the first time in years. Not complaining, just noting the significance.

1 Like

We’ll likely test a tf-micro implementation for wake word. The ESP BOX uses the ESP32 S3 which is significantly more capable than the ESP32 so we’ll have even more headroom.

Glad to hear!

We’re well aware that our emphasis on GPU for and/or higher performance CPU for inference is troubling to many users.

However, we’ve noticed that many people (sounds like you’re one of them) are using self-hosted Willow as an opportunity to re-asses their hardware configurations. Once “bitten by the bug” that is HA, Frigate, Plex, and who knows what a lot of people end up with a less than ideal random collection of Raspberry Pis, NUCs, etc they’ve assembled over the years. Combining them on a single (larger) machine almost always results in a significantly better experience all around while using equivalent power (or less) at actually lower cost assuming reasonable resale value and/or productive use of the replaced hardware for other tasks.

1 Like

Recently I have begun testing the ODROID N2+ and saw the boards contain the Mali-G52 GPU.

Looking at your list of GPUs they appear to all be NVIDIA. Any plans to support others?

I expect performance to be lower but better than CPU alone.

Banana Pi have the Mali-G52 on one of their boards and they still added a TPU to up the TOPS to 5.
https://www.banana-pi.org/en/core-board-and-kit/129.html

Mali-G52 is way too bad for this kind of work, where the boards being taken into consideration delivers around 150 TOPS and up.

Thanks, I did not think to do the research on how many TOPS the GPU could perform. #NewToThis

Sorry, a month later (we’ve been busy)!

We only support CUDA (and CPU) because the ecosystem Nvidia has built over the past 15 years is still light years ahead of AMD ROCm, Intel GPU, random ARM boards, various TPUs, etc. If you can get past the Nvidia “driver issue” that rubs a lot of people the wrong way the economics, speed, quality, etc are currently impossible to compete with - a $100 used seven year old GPU stomps all over even my ridiculous ThreadRipper system at a fraction of the cost and power usage.

For the ARM boards, etc that other people have mentioned you run into two issues - memory and ecosystem/software support. Speech recognition models with the quality we consider to be a minimum for voice assistant tasks (Whisper Small, beam size 2), TTS, etc eat into memory pretty quickly. We have a lot of users that just go “all out” and use Whisper large with beam size 5 on their GPUs - using around 4GB of VRAM when combined with TTS. From their perspective it’s still a $100 GPU that idles at ~10 watts and leaves plenty of VRAM for stuff like Plex, Frigate, Jellyfin, whatever (as I mentioned above). It also enables the use of the even more advanced stuff we support like voice fingerprinting, speaker identification, speaker authentication, etc. With CUDA and Nvidia you can pretty much grab anything in the ML/AI/media space and it “just works” - and works very, very, very well.

Additionally, much/most of this hardware is essentially “dropped off” by the manufacturer with virtually no software support. Not to mention projects like Ctranslate2 (which we and HA use) that have incredible CUDA software optimizations, often delivering 10x the performance of equivalent ROCm hardware (as one example) that on paper should be competitive. The real-world difference between an Nvidia CUDA GPU spec sheet and these ARM boards isn’t even close.

In practice what you end up with is 5 TOPS (to pick a number) on Nvidia hardware often looks more like 50 TOPS or more compared to 5 TOPS on an ARM board/AMD ROCm/random TPU/etc that because of lacking software support doesn’t even come close to the theoretical performance they put on a spec sheet.