Esp32-s3 KWS

Hi, wondering if there are any devs out there who could help.

I have been frustrated by the lack of good accuracy opensource models for well over a year, might be 2!
I have found a method to reduce false positives substantially that has no overhead apart from work creating the dataset.
I will tell you what that is and then after explain why but basically I have been able to do this for a while with ‘my voice’ but failed to find datasets until mlcommons was released, with 50 Languages and Over 23 Million Audio Keyword.

The process is really simple and its to add ‘sounds-like’ classifications around the keyword to create more cross entropy and force a model to train harder to distinguish and also those classifications act as additional catch-all to what is commononly known as the ‘unknown’ classification.

I have been using the syllable method from NTLK to create a syllable database of the words in a MLCommons language dataset.

So if I take ‘home assistant’ s1=‘ho’, s2=‘me’, s3=‘ass’ & s4=‘ant’ and query the database made to create silence trimmed words and combine to make for ‘sounds-like’ classifications of s1.s3, s1.s4, s2.s3, s2.s4.

Unknown is still used but a query makes sure only words not in the ‘sounds like’ classification are part of unknown otherwise input cross entropy would be shared and create a lower softmax.

In the English dataset I grab ‘Home’ & ‘Assistant’ combine to get the KW.
I have been using a pdsounds dataset for noise.

So you have KW, Noise, Unknown, S1, S2, S3 & S4 as classifications and on any model that structure will be far more robust to false positives.

A 2nd model is more standard but a command model where the KWs range from ‘Turn, Off, On, Light, Fan, Play, Stop, Pause, Resume’ and then the usual Unknown & Noise.

The accuracy of the 1st KW model is critical and ‘HomeAssistant’ is actually a great unique multi syllable KW where the 2nd model only runs after a 1st model hit.

So going back to Why?

Classification models are purely graphs to organise images by parameters and there is no intelligence in this AI.
So if you have what has becomes a standard model in many KWS of a binary model in many examples littered across the internet of 2 classifications KW & Unknown and its a really bad binary model of a very narrow variance of KW & a huge sprawling variance of Unknown that create much less cross entropy. Unknown becomes this see-saw of adding selected words or noise to stop false positives changes the balance and false positives occur elsewhere whilst never gaining any more cross entropy.
Some other models add a ‘noise’ classification so ‘unknown’ becomes speech alone hugely increasing cross entropy and reducing false positives but due to the complexity of language everything but KW is still huge variance and still lacks cross entropy.

The answer is soundslike additional classifications that are much easier to deal with than the infinity and beyond of unknown with known syllable alike classification where 2 word KW such as ‘Home Assitant’ & ‘Hey Google’ work really well as you can combine words that give similar spectral images.
As said your model becomes far more accurate against false positives and generally overall becomes more accurate as during the training Tensorflow is forced harder to find parameters to differentiate.

Hardware
The vector instructions in the esp32-s3 for both MFCC & AI mean its up 10x faster with the LX7 than the LX6 ESP32 and all this can definitely run on a single core with room to manoeuvre. It will also run on ESP32 as https://github.com/42io/esp32_kws demonstrates but it does hog much resource and likely dropping to a standard CNN model would halve the load of the demonstrated DS-CNN (the repo refers to it as a DCNN I know it as a DS-CNN) and only lose less than 3% accuracy.
So esp32 is very possible I am just aiming to get the best results 1st and see how that goes down in use as if anyone has used the reference designs of the ESP32-S3-Box as I have you will find actually its pretty impressive and does well but compared to a Google Nest Audio its not great.

Its strange really as its has 2mics with AEC & BSS that incur much load where it uses a much lesser heavily quantised KWS model to get everything to fit on one of the bigger allocated S3’s.
The AEC & BSS is not great and fails at reltively low noise thresholds and with some lateral thought you dont need far field distributed Mic arrays, you just need distributed KWS that are accurate.

Implementation

This is the bit I could really do with some help with as I want a HomeAssitant server that will receive the softmax (KW probability 0-1) value of several distributed KWS and pick the client with the highest KW softmax that will represent best signal.
Now if someone wants to do something where this will then stream to ASR then please do but I am concentrating on KWS where KWS in a zone have simple command classification of approx 10-30 KW that covers a huge amount of common controls.

Over the last 2 years I have been testing a plethora of devices and funky microphones and various opensource and none of it works well as often its not integrated or totally lacks the low level DSP or sound engineering to make it work well.

What I have done is not clever its pure lateral thought and increased the accuracy of KWS substantially firstly by using a better model and secondly by using a better dataset and classification system, reducing false positives and cast off what is relatively pointless poor high load far field implementations and replace by multiple KWS that are embedded with a sensor or actuator and multi-purpose where hopefully one is placed in near proximity to you as opposed to noise.

You scrap the idea of fighting audio physics with powerful technology and your replace with multiples of low cost devices and it works much better than anything else we have than a choice of the likes of Google and Amazon.

The real star of the show is 42io https://github.com/42io/esp32_kws whilst creating dataset and training is simple, it can be really finicky tedious bore so I am doing that currently for a ‘HomeAsistant’ KW.
If you have 10-30 command words I am up for suggestions along the lines of ‘Turn, off, on,
lights, fan, open, close, curtain, blind…’ and see what words I can extract out ML-commons.

Also if you really want to go to town then we can use ‘On device training’ where a global language model can be transformed by a locally trained small model of captured use and also quite simple to implement as ‘on device training’ is a misnomer as its done on the server and I have another question on how can you OTA a model from a local network as that is all that is needed.

I work in fits and starts but prob will post some models and python evaluation scripts this week, but anyone can get pretty great KW working by using the methods I just detailed.
I have a dev repo @ https://github.com/StuartIanNaylor/ProjectEars which is of no consequence but a load of tools there that will save you some time and hack grab and refactor at will as have no long term interest in supporting this stuff and someone will do a much better and eloquent job of supporting.
Its actually quite easy to get much, much better results than we currently have and without the likes of Google & Amazon.

https://github.com/42io/esp32_kws is such a great reference but do a CNN rather than DSCNN on esp32 if not using an S3, it will fit but things will be a squeeze.
There are other models but RNN layers such as GRU or LSTM or absent from many of the frameworks we have excluding many models. Doesn’t matter that much as following the the classification dataset methods of above will give you a far greater bump in accuracy than any model change.

I will throw up some models that maybe can become part of a long term model zoo.

For KWS model benchmarks Arm sort of kickstarted things https://github.com/ARM-software/ML-KWS-for-MCU and then GoogleResearch provided google-research/kws_streaming at master · google-research/google-research · GitHub that both give a great overview of what was and what is the latest and greatest in KWS.

In tensorflow a bit hidden away is the MFCC front end for micro-controllers also.

1 Like

I added the 42io to my https://github.com/StuartIanNaylor/ProjectEars so its just a folder as I am going through this as I write. I did test this quite a while ago so with my memory all is forgot, but running it the original script seems fail on some relatively pointless SHA checks so the example in the above repo should just work.

In projectears there is now the 42io folder where the dataset creation should run from ./tflite_kws/dataset/google_speech_commands/dataset/main.sh
To convert that .data file to .npz in 42io/tflite_kws/dataset I added data2npz.py & also train.py that should train from the kws.npz file you just created.
It took a long time took up my system for most of the day and suggest do an overnight or don’t bother as the resultant dcnn.tflite is also there.
Also had to do one quick fix to the compile but all binaries and fixes are included here.

So if you put you headset on set the mic and source realtime.txt you should be able to test the model that I just created from the 42io repo and find out its pretty ropey and why I included it as a datum.
Why is it ropey? Firstly the Google Command Set is an awful production dataset its just become this benchmark and often ends up in examples.
I did the default dataset/main.sh zero one two three four five six seven eight nine and the script goes away and creates a dataset from the Google Command Set & PDSounds.
What it does is add pdsounds as a noise class and 1in5 it adds noise to KW samples and the rest of the words in the Google Command Set as #unknown# and its a really bad selection of a few words that no way represents normal phonetic diversity of speech and is full of holes.

A KWS is a classification model and it has to make a choice and using GSCv2 for unknown is a prime cause for the results we get. A secondary cause is many of the KW are very short words and often its surprising the speed we do talk at as most of those words could be trimmed into a 4 sec sample with space, so in a 1 sec wav 60% is audio whitespace, which is bad.
It would be better to create the model with different parameters that fit sample length to the longest word and try not to use short words with long multi syllable words.
The model doesn’t care what the words are its purely comparing spectral images of a set time length and always think about it that way.

I think 42io is one of the Googleresearch team doing a pet project with cascading models and esp32 and the rough dataset is quite deliberate as if you look at the 3 tier cascade models it cuts false positives from 4787 to 10 and this is why this benchmark dataset crops up so much as what he did worked so well on a production database it would always be 100% with zero false positives and also his lesser attempts and its useless info to him and us as a benchmark.

No-one really ever tells you how to make a production dataset as 100% of the examples I have seen are benchmark examples or people copying verbatim and thinking that is it and its not.
The model is perfectly accurate to the classification system you give it and due to what is been provided its ‘overfitting’ and you can see in the training graph.

My train was the same and even tried upping the dropout from 0.1 to 0.2 as often helps with overfitting but it did exactly the same and jumps up on the 1st epoch to 92% accuracy.
There is so little cross entropy its not that the model is not accurate as each class is a huge target and the model is struggling to miss.
You will know when you get it right as the 1st epoch will be 60% or lower and the learning curve much less of a steep incline as it takes longer to learn a more specific narrower classification.
The early stopping on the model means it keeps going until val_loss remains unchanged for 20 epochs that was over 400 epochs on my PC and far too many hours and reports its 99% accurate because it is, when you do a model though that is it and job done and can be used by many and guess for dev time its quite short.
With the data its been provide and the classification structure created it can not miss and then you start to use it was external data and as a model its a total WTF.

The great stuff from 42io is the esp32 repo and the models and methods not the dataset, the dataset created is awful and that is what I am going to do next.
I will create ‘Home Assistant’ dataset and train with the very same model it has 1 KW ‘Home Assistant’ but contains 7 classifications KW which unusually ‘Home assistant’ is 4 syllables and for once prob a good fit for 1sec samples.
I have to go through PD sounds and remove and samples with voice in them as that minimises and cross entropy to unknown which is voice.
https://pdsounds.tuxfamily.org/ is a good collection of noise but it does have so voice contamination and its much better to be without.
Multilingual Spoken Words Corpus - 50 Languages and Over 23 Million Audio Keyword Examples | MLCommons is this huge resource where we can brute force all phonetic combinations by selecting a few of every word with 3 to 5 syllables to create unknown.

Then we are going to narrow the keyword of ‘Home Assistant’ with similar syllable words to create more cross entropy as we mix up and combine words that contain ‘ho, me, ass, ant’ and create ‘s1.s3, s1.s4, s2.s3, s2.s4’ and using the great resource of ml-commons and its huge diveristy of words and sample count. They act as additional specific syllable ‘unknowns’ and will reduce false positives without the extra load of cascading models.

Unfortunately it doesn’t end there as there isn’t a dataset that doesn’t contain bad samples and ML-Commons is no different, in fact for some words in the manner they have extracted them it can be pretty bad.

So I will prob do a short train run set it to maybe 20-30 epochs max and then test the dataset on the trained model and start deleting bad.
Usually KW that have very low score, Unknown that has a high score on the noise index and then to trim the sounds like and remove the sounds-like that have a very low score.
Usually I do that on 2 separate shorter mild trimming of just the very bad on the 1st as often even though a low qty is removed it can often make a big difference to the resultant next train that gets a few more epochs but a more aggressive and higher qty where might even be up 10% are removed but often much less more like 1%.
Then a final training run, so I might be a couple of days maybe longer as it steals my PC and its quite long winded boring process.

I will provide a ready made model maybe will not be great but it will be good enough for use as a dev model and one ready for ESP32 but I will have to do the 8bit XNN conversion afterwards, but that fairly easy.
I have never used https://github.com/espressif/tflite-micro-esp-examples#esp-nn-integration but it should be interesting to see how the model performs on an ESP32-s3 as should be around 7x faster than esp32 but esp-nn optimises greatly for both.

It might be a while until I get round to doing a command model as its such a boring and long winded process it sucks your life force and will at least not want to look at it for a week.

I will get the 1st ‘Home Assitant’ model done some time this week and you can just test it with python or the 42io methods with a mic.

PS the clones have started of the S3 so hopefully they will start to get ever cheaper being a new release.

I will post this now as I am doing it as slightly short of ‘assistant’ got over 600 but far to much of me in there :slight_smile:

Thats combining home+assistant not augmented yet which will add variance and not filtered so prob bad in there and looking like the KW length will be 1.4s so have to modify model slightly.

In https://github.com/StuartIanNaylor/ProjectEars I have a folder called ‘reader’ which is a cli on screen word reader as a prompter as its much less prone to error that trying to extract words from sentences.
It would be really great to get a dataset and maybe common command words as likely there might be ones similarly short of words but I can make do with what I have, but more is better and regional, gender metadata is even better for providing targeted datasets.
If not cli maybe a web-boutique doing similar and keep ‘home’ and ‘assistant’ separate as the 3k+ ‘home’ & 600+ ‘assistant’ can quickly combine to make many more as in the above.

60k is my usual minimum as it balances with the the big variance I want in noise and unknown and often do 100k.

Anyway I just thought I would mention this now as if something was avail words and voices could trickle in so its always avail in future and having a dataset for download would be great for others who may want to do something better than me.

[EDIT]
A reminder but once more wondering if noise should or should not be added to unknown. Without noise unknown would likely have less cross entropy with noise and likely make noise easier to pickup for an inverse VAD. Flipside is KW has noise so not having noise in unknown means less cross entropy but maybe the ‘soundslike’ covers that.
I always add noise to unknown and I always forget to try it without as a test but will run with.

1 Like

I am not a subject matter expert, but I think that your approach for the KWS fails a main requirement. Makes it distinguishable and short enough…

‘Home Assistant’ is a too common and far too complex in length. Look at he other industry KWS, Google, Alexa, Siri, Bixby, Genie, Almond. All of them are short, and nothing similar you can think of that could match it.

How often do you say ‘Home Assistant’ how often does anyone say ‘Home Assistant’ it a squeeze to get in 1 sec but can be done.
PS its “Hey Google” or “OK Google” which is near no difference in length. Alexa is a very good KW as it is 3 distinct syllable which correspond to 3 distinct spectra around the phones.
Our intonation doesn’t take word length into account its purely the phone count of the syllables that sets length.
In terms of syllables short is bad and would say Almond is the worse there as don’t think of it as words just purely phonetic syllable breakdown.
Any thing with an ‘And’ in it already and its half way there.

There are 2 good KW there when it comes to ease and spectral uniqueness in a contiguous sequence 3 syllable is better than 2 and even though HomeAssistant would not of been my 1st choice actually with some thought its quite possible and also is the name of the project, but it is 5 syllables with clear spectra.
Its also done because I could get many ‘Home’ and quite few ‘Assitant’ words already as getting a dataset by an individual is often very hard.
I did think ‘Hey H A’ as the continuous phonetic stream is not something I would say but would have to give a shout out for users to submit ‘H’ & ‘A’ spoken out as words.

It doesn’t really matter as KW is choice and its not my dictate but ‘HomeAssistant’ for an example on HomeAssistant seemed a good choice.
The main thing with KWS in general due to softmax works is to have far more catch-alls than just unknown or you will get many false positives.

Softmax in very simple terms is kw value divided by the sum of each current classification value and the math looks like this.
Screenshot from 2022-07-14 12-35-21

If you have a low number of classes you can get KW hits (false positives) not because the input is like the KW the hit is because its not like the other classes so its ratio becomes a hit.
Many examples out there do not have enough classifications 1st to create cross entropy and narrow what is acceptable but also to increase K from the above math.
Hence creating subsets of unknown based on phonetically similar words and word combinations without being the same.

I am here to talk about any KW, ‘HomeAssistant’ is just an example but yeah they could prob use it if they wished, but that is not for me to say.
Its like ‘Hey’ & ‘OK’ which are 2 of the most common phrases ever but a KW isn’t words its a unique sequential sentence as ‘HeyGoogle’ or ‘OKGoogle’ becomes because its not something we are likely to say and for me I doubt I will ever sit in my room and say ‘HomeAssistant’ but there KW is not my problem.

Usually I use the GoogleResearch framework & Tensorflow-lite on the Pi which adds the additional noise for you, but as the bash script from 42io is amazing I find Python easier so had to create something and will have a ‘HomeAssistant’ that is just KW, Noise & Unknown and then with the same model KW, Noise & Unknown with 4 additional ‘soundslike’ classification.
Purely as an example of how you can vastly improve robustness of KW with no increase in load it just is more faff to create.

The biggest advantage of shorter KW is less load as a MFFC image for the DSCNN (1sec) is 49x13 32bit floats and for 1.4sec jumps up to 69x13 so far more parameters and load.
Its down to choice, but always add more classes to catch unknown as it will not add load only accuracy.

Also to mention its a bit of a squeeze on esp32 but the esp32-S3 models are starting to get clones and are much cheaper than 1st release where its vector instructions are almost x10 faster than the esp32 where it counts with FFT for MFCC and tensorflow AI in general.
Still though the only 2 KWS models that run on micro specifically ESP are CNN & DSCNN and thought I would share the code for the models for both and mention what a great resource Multilingual Spoken Words | MLCommons is.

PS this is what homeassistant looks like and what really we are checking for.

Main idea is a prompt for a model zoo of designed for esp models for HA .

I don’t know how are you with that, but I am watching time to time Home Assistant related videos from YouTube on a TV screen. And they do say “Home Assistant” quite a lot.
I do talk about Home Assistant with my wife as well, who just calls it “SmartShits”, after her previous experience with SmartThings.

Google and Amazon filtered out some pre-recorded voice patterns since some advertisements triggered mass executions.

I get all of your points, but think of real life conditions and not just mathematical models.

I think you might find that is true of all apart from Google that creates voice profiles. Often Alexa is referred to as “she who shall not be named”
Doesn’t matter as I am starting a conv on training devices models and asking if there should be a model zoo. Also using custom training via transformers so it works like Google.
This is all possible as you train elsewhere and ota the new model.
My internet has gone kaput so waiting for an engineer using phone so verbosity is not good

For hardware, look at the openHASP project and the switch plate devices with ESP32, that would make sense to combine with.

Hey Hassio

1 Like

Yes would seem a good idea there is the possibility of relaying the softmax hit of several distributed kws to select the best mic stream also than just singular device
The idea isn’t to be specific to hardware and just offer a model zoo that is Hass concentric.
Posting here as presume there are also esphome & tasmota Devs as a Hass kws server could be a good idea that merely selects the best softmax and cancels the other streams

[Edit]

Added the same model dscnn.tflite and sl-dscnn.tflite as an example that extra classes makes no difference to performance and the dataset.npz is in there so you can see how much less of a steep accuracy curve it makes when training.
I should of edited the train script to save the model of the non-stream so I can use that model with the dataset to trim out bad.
Usually I do this in 2 passes a 1st after a relatively short train 20-30 epochs which is a light trim of the worst offenders as Ml-commons is great but it has its fair share of bad samples and those few can have much effect.
Using the model you are training makes it fairly easy but you will have to look at the non-stream examples with python in https://github.com/StuartIanNaylor/ProjectEars
Then on that new dataset where likely would create excess samples and replace the bad as 42io seems to insist on exact balanced classes a longer train and a more aggressive trim.
Then a 3rd and final long train where you will regain much accuracy but also gain the advantage of the additional sounds-like unknowns.

The database file to get files is included in the above as an example (sqlite).
Also the 42io uses a argmax count method which I don’t think is that great as due to intonation and 3rd party noise sometime part of the stream can dip and not be the arg max but overall the KW is strong.
The count method often needs to be on the short side to work and is not as accurate as a ‘running average’ that is reset on kw hit which again is in the python examples of https://github.com/StuartIanNaylor/ProjectEars

Pardon my ignorance, but does your work provides the open source way to set the custom wake word for Espressif SoCs? There is no word on how to create own wake word in ESP docs, the only wai is tu “buy” model from them Espressif Speech Wake-up Solution Customization Process - ESP32-S3 - — ESP-SR latest documentation

They actually say what model they are using and I know a framework that create the models as thanks to GoogleResearch the BCResnet that Espressif use has a training framework.
Espressif will create you a model or you can create your own KWS and just use the ADF of AEC-NS-BSS.

You need to enable padding in the training framework.

You can use Tflite4Micro that has a frontend that should couple to the ADF.
Its in here tflite-micro-esp-examples/components/tflite-lib/tensorflow/lite/experimental/microfrontend/lib at master · espressif/tflite-micro-esp-examples · GitHub

Its basically a AGC and MFCC feed for the model(s) I say model(s) as I have been wondering how they select the right stream from the BSS (Blind source seperation) and pretty sure they use a model for each stream (nMics).

Espressif DL seems very Pytorch Onnx friendly and likely a good way to go, I just know Tflite4Micro much better, or should say I haven’t tried Onnx much.