Hi, wondering if there are any devs out there who could help.
I have been frustrated by the lack of good accuracy opensource models for well over a year, might be 2!
I have found a method to reduce false positives substantially that has no overhead apart from work creating the dataset.
I will tell you what that is and then after explain why but basically I have been able to do this for a while with ‘my voice’ but failed to find datasets until mlcommons was released, with 50 Languages and Over 23 Million Audio Keyword.
The process is really simple and its to add ‘sounds-like’ classifications around the keyword to create more cross entropy and force a model to train harder to distinguish and also those classifications act as additional catch-all to what is commononly known as the ‘unknown’ classification.
I have been using the syllable method from NTLK to create a syllable database of the words in a MLCommons language dataset.
So if I take ‘home assistant’ s1=‘ho’, s2=‘me’, s3=‘ass’ & s4=‘ant’ and query the database made to create silence trimmed words and combine to make for ‘sounds-like’ classifications of s1.s3, s1.s4, s2.s3, s2.s4.
Unknown is still used but a query makes sure only words not in the ‘sounds like’ classification are part of unknown otherwise input cross entropy would be shared and create a lower softmax.
In the English dataset I grab ‘Home’ & ‘Assistant’ combine to get the KW.
I have been using a pdsounds dataset for noise.
So you have KW, Noise, Unknown, S1, S2, S3 & S4 as classifications and on any model that structure will be far more robust to false positives.
A 2nd model is more standard but a command model where the KWs range from ‘Turn, Off, On, Light, Fan, Play, Stop, Pause, Resume’ and then the usual Unknown & Noise.
The accuracy of the 1st KW model is critical and ‘HomeAssistant’ is actually a great unique multi syllable KW where the 2nd model only runs after a 1st model hit.
So going back to Why?
Classification models are purely graphs to organise images by parameters and there is no intelligence in this AI.
So if you have what has becomes a standard model in many KWS of a binary model in many examples littered across the internet of 2 classifications KW & Unknown and its a really bad binary model of a very narrow variance of KW & a huge sprawling variance of Unknown that create much less cross entropy. Unknown becomes this see-saw of adding selected words or noise to stop false positives changes the balance and false positives occur elsewhere whilst never gaining any more cross entropy.
Some other models add a ‘noise’ classification so ‘unknown’ becomes speech alone hugely increasing cross entropy and reducing false positives but due to the complexity of language everything but KW is still huge variance and still lacks cross entropy.
The answer is soundslike additional classifications that are much easier to deal with than the infinity and beyond of unknown with known syllable alike classification where 2 word KW such as ‘Home Assitant’ & ‘Hey Google’ work really well as you can combine words that give similar spectral images.
As said your model becomes far more accurate against false positives and generally overall becomes more accurate as during the training Tensorflow is forced harder to find parameters to differentiate.
Hardware
The vector instructions in the esp32-s3 for both MFCC & AI mean its up 10x faster with the LX7 than the LX6 ESP32 and all this can definitely run on a single core with room to manoeuvre. It will also run on ESP32 as https://github.com/42io/esp32_kws demonstrates but it does hog much resource and likely dropping to a standard CNN model would halve the load of the demonstrated DS-CNN (the repo refers to it as a DCNN I know it as a DS-CNN) and only lose less than 3% accuracy.
So esp32 is very possible I am just aiming to get the best results 1st and see how that goes down in use as if anyone has used the reference designs of the ESP32-S3-Box as I have you will find actually its pretty impressive and does well but compared to a Google Nest Audio its not great.
Its strange really as its has 2mics with AEC & BSS that incur much load where it uses a much lesser heavily quantised KWS model to get everything to fit on one of the bigger allocated S3’s.
The AEC & BSS is not great and fails at reltively low noise thresholds and with some lateral thought you dont need far field distributed Mic arrays, you just need distributed KWS that are accurate.
Implementation
This is the bit I could really do with some help with as I want a HomeAssitant server that will receive the softmax (KW probability 0-1) value of several distributed KWS and pick the client with the highest KW softmax that will represent best signal.
Now if someone wants to do something where this will then stream to ASR then please do but I am concentrating on KWS where KWS in a zone have simple command classification of approx 10-30 KW that covers a huge amount of common controls.
Over the last 2 years I have been testing a plethora of devices and funky microphones and various opensource and none of it works well as often its not integrated or totally lacks the low level DSP or sound engineering to make it work well.
What I have done is not clever its pure lateral thought and increased the accuracy of KWS substantially firstly by using a better model and secondly by using a better dataset and classification system, reducing false positives and cast off what is relatively pointless poor high load far field implementations and replace by multiple KWS that are embedded with a sensor or actuator and multi-purpose where hopefully one is placed in near proximity to you as opposed to noise.
You scrap the idea of fighting audio physics with powerful technology and your replace with multiples of low cost devices and it works much better than anything else we have than a choice of the likes of Google and Amazon.
The real star of the show is 42io https://github.com/42io/esp32_kws whilst creating dataset and training is simple, it can be really finicky tedious bore so I am doing that currently for a ‘HomeAsistant’ KW.
If you have 10-30 command words I am up for suggestions along the lines of ‘Turn, off, on,
lights, fan, open, close, curtain, blind…’ and see what words I can extract out ML-commons.
Also if you really want to go to town then we can use ‘On device training’ where a global language model can be transformed by a locally trained small model of captured use and also quite simple to implement as ‘on device training’ is a misnomer as its done on the server and I have another question on how can you OTA a model from a local network as that is all that is needed.
I work in fits and starts but prob will post some models and python evaluation scripts this week, but anyone can get pretty great KW working by using the methods I just detailed.
I have a dev repo @ https://github.com/StuartIanNaylor/ProjectEars which is of no consequence but a load of tools there that will save you some time and hack grab and refactor at will as have no long term interest in supporting this stuff and someone will do a much better and eloquent job of supporting.
Its actually quite easy to get much, much better results than we currently have and without the likes of Google & Amazon.
https://github.com/42io/esp32_kws is such a great reference but do a CNN rather than DSCNN on esp32 if not using an S3, it will fit but things will be a squeeze.
There are other models but RNN layers such as GRU or LSTM or absent from many of the frameworks we have excluding many models. Doesn’t matter that much as following the the classification dataset methods of above will give you a far greater bump in accuracy than any model change.
I will throw up some models that maybe can become part of a long term model zoo.
For KWS model benchmarks Arm sort of kickstarted things https://github.com/ARM-software/ML-KWS-for-MCU and then GoogleResearch provided google-research/kws_streaming at master · google-research/google-research · GitHub that both give a great overview of what was and what is the latest and greatest in KWS.
In tensorflow a bit hidden away is the MFCC front end for micro-controllers also.