I tried ODAS so an X on that one as the quality is bad. In fact all beamforming is bad if there is not a mechanism to lock the beam.
This would likely be done via a KWS that can attain a target as you collate beamform info for the KW period and maintain that beamform until the command sentence stop.
Delay-sum is the simplest sort of beam and did an example if someone ever wants to tidy and optimise my 1st ever C/C++ hack.
Margin is the max TDOA which is dictated by mic distance (mm) divided by the factor of speed of sound (343 m/s) divided by sample rate 60mm mic spacing @ 48Khz = 60 / (343000/48000) = 8.39 so a int margin of 8 16Khz only gives 2.798 so of no use
/tmp/ds-out contains current TDOA so poll to set LEDs To monitor watch -n 0.1 cat /tmp/ds-out
To fix the beamformer write a file to /tmp/ds-in echo 1 > /tmp/ds-in sets beam to a delay of 1 Delete file to clear and use TDOA
You need some feed back so the KWS controls the beam and all that is currently sort of missing. That is why conference mics and speaker phone don’t work all that well with any dynamic noise as they are freefloating and beamforming to predominant input and have no target for the full command sentence.
I used Delay+Sum on a simple 2mic because the more advanced and more mics expotentially the computional needs rise.
There is MVDR / GSC and a whole rake of beamform algs that are only really suited to highspeed DSP.
So no to the Respeaker 6-mic and use there DSP with the USB version.
Speed of sound is 343ms so in mm 343,000mm divide that by your samples rate say 48Khz = 7.14mm for a single sample difference and really you don’t want to be going above 65mm approx due to aliasing.
But @ 71.45mm you get x10 samples diff to do your beamforming on and that is the only reason mems mics are beneficial as they hold really tight tolerances.
So no to beamforming as wow it takes some computational power and unless your Amazon and bake it into silcon its prob a big no, as often they are above 48Khz to garner bigger sample diff.
Also watching how Google proceeded they copied Amazon at the start and then thought hold on this beamforming ain’t that great anyway and dropped to x2 Mics to some sort of BSS instead (VoiceFilterLite) as likely in that 80/20 rule of how smartspeaker are a domestic consumer product the biggest noise problem is often very spatially unique from the voice command.
So likely they use a DUET type 2x channel BSS and maybe even run without AEC.
Out of the 2 Amazon are losing huge sums of money where Google for them not so much.
I have never found any Linux realtime code for 2 channel BSS and have been searching for a long time and it will take a DSP opensource guru than me.
Esspressif do have a binary blob in there ADF
https://www.espressif.com/en/solutions/audio-solutions/esp-afe
Its any S3 and likely a ADC module and 2x Max9814 and a £7 Esp32-s3 like the T7-S3 – LILYGO® (Its so cute) could do the job and just use the esp-afe (audio front-end) and websockets libs (TCP).
I hate the term satelite as there is no such thing as in automation we deal with devices and the term satelite seems to assume a Mic system, Pixel indicator and audio out, whilst for me those are all seperate items allocated to a zone or subzone (assembly) but essentiall seperate devices.
ESP32-S3 Korvo far too expensive even the ESP32-S3-Box as what you should be able to do is position several mics in a zone so that pickup is always beneficial to any sources of noise.
Likely 2 or 4 mics aranged to benefit room setup where simple positional physics is cheaper and of higher quality than trying to source a voice command in a tsunami of noise.
On the esp32-S3 the BSS splits the signal into x2 spatially different signals and I have hunch that they run 2x KWS and the one with the hit uses that stream for that command sentance.
If you jetison all the dross off an ESP32-S3-Box as who needs a screen so small that to be able to see what it says you have to be close so its pointless using voice control as it might as well be a switch.
The onboard amp is toy like and so is the TTS and the ASR is pretty limited.
So I am a esp32-s3 fan and is it the pcm1801 2x I2S aliexpress modules are avail as so are Max9814 mics and just hook up to the AFE and even just send the 2x streams to a Pi4 for KWS.
I think with TFLite4Micro we can do exactly what David Scripka has done but to be honest I could not care less about custom KW, I just want a KW that works 100% and well.
Big data does this by high quality datasets of use and collates metadata of hardware and user and has created huge high quality datasets.
Allow users to opt-in to collate a dataset of use but also collate KW and command sentances locally and use transfer learning where a smaller on-device trained KWS model biases the weights of the pretrained so it creates a profile of use.
I think upstream on a Pi or otherwise due to the huge idle lengths that slowly it can tick away and train small models of captured data and OTA the KW to KW devices.
So its not exactly on-device training its upstream ondevice training and the KWS will get more accurate through use and it will adapt to the users and hardware it consists of.
A PiZero is like the M5 but at least you have choice of audio hardware.
The 2 mic hat often is placed facing the ceiling and you lose all the rear rejection the PCB gives that a right angle connector should fix or wall mount.
Its not capable of any algs or KWS but like the M5 it could transmit the streams upstream and also gives choice to improve with mic preamps and analogue AGC or use the 2mic as its not that bad and an improvement over the M5 for not much more.
Pi Zero2 or many of the really great clones we have now on a Cortex A53/A55 has more power than a ESP32-S3 but we don’t have any opensource BSS say as this commercial one.
https://vocal.com/blind-signal-separation/degenerate-unmixing-estimation/
So for me currently its only a ESP32-S3 that also only broadcasts on KW hit as I hate the idea of 24/7 bugging and eavesdropping my own home wirelessly.
I think any $20 SBC should even do a better job but until some Opensource C/C++ DSP guy(s) donates his time to create some form of BSS we are stuck with beamforming and I think its a bit Mweh or ESP32-S3 and the Esspressif blobs.
I should say though there are some amazing filters as has anyone ever checked the quality of
The LADSPA filter will run on a Pi5 (It runs on a Opi5 with just the odd buffer underun) via ALSA or Pipewire and its quality is actually RtxVoice like (Really if someone dropped the sonos tract engine and ported to TFlite or Onnx then it would not be single thread only and not just need a single big core.
So upstream full bandwidth audio to a filter is also another choice.
Also allow users to opt in and donate metadata rich data to the Dev provider to make better KWS & ASR as we can make great datasets and collate as we go along.
We just need hardware, gender, age band, region, native lang, we do not need your name or address…
Then opensource can start to compete with the huge dataset advantage Big data has.