[Voice PE] Only complaint: Ability to distinguish one voice from another not good

Been living with and using the voice PE for about a week now and generally I’m really impressed with the whole system. Especially with an LLM conversation agent, it seems a lot “smarter” than my Google homes ever were.

However, it seems to not be very good at distinguishing one speaker from another. For example, if I ask for the weather forecast but have a YouTube video playing in the background, it will take much longer to finally respond (I guess cause its waiting for the YouTube speaker to pause haha) and then answers with something like "the current temperature is 2 degrees, but I’m not sure what to do with the fact that the Samsung s24 is much better than last years model":joy:

Which, granted it handles magnificently every time, but that’s not the desired behaviour haha. I’ve already set the speaker cutoff detection to aggressive, is there any other tips to improve this maybe?

1 Like

The model from xmos is purely a farfield voice model, doesn’t have any form of targetted voice extraction.
Its always going to have problems with the ‘cocktail party’ problem of multiple voice.
Maybe xmos might release a model where you enroll a profile for targetted voice extraction, but for now likely only low volume concurrent voice will likely not affect all that much.
TV & Youtube speakers likely always going to be a problem without targetted voice extraction.

In theory, with advanced sound processing, it should be able to distinguish the physical direction of the source of sound, which could be used to detect who triggered the wakeword and then only listen to the person coming from that direction.

In practice, there has not yet been implemented in the PE. I also have concerns about whether that would be practical to implement with only two microphones. Most devices that implement pinpointing sound sources usually have four or more microphones, so they can do the complicated maths to resolve which angle and azimuth audio is coming from.

A hack you could use to help your situation, which is something that I will implement later today at home, is to duck or mute sound sources from other media players around the home whenever a wakeword is triggered. This, of course, requires you to have pretty much every media player at home under the control of home assistant. Which is not a problem for me, but maybe a problem for you.

With 2 mics unless facing the sound you can not do that as it can not differentiate front and back.
Needs a minimum of 3 mics in a triangle.
I hacked together a realtime delaysum 2 mic that uses an alg to get the time delay in samples between 2 mics.

It was GitHub - robin1001/beamforming who really did the math work I just added portaudio and mangled and molested C code a bit to make it realtime than file based.
The Alg is called GccPhat GitHub - FrancoisGrondin/gccphat so for 3 mic it would be nMic -1^ computation, with the 1st mic always being the ref point.

I did a delay-sum as once you have worked out the delay you grab that mics delayed sample +/- and sum it with the ref mic. It not a great way but that bit is very simple than the FFT to match the x2 signals so you can work out the delay (FFT… !?)
The xmos is not beamforming though, I think its using something similar to GitHub - yluo42/TAC: transform-average-concatenate (TAC) method for end-to-end microphone permutation and number invariant ad-hoc beamforming. but is guesswork but some are evolving TAC to use voice embedding to steer a similar model and get targetted voice extraction.
Its sort of beamforming as the phase information acts as a hint to the model to help extract voice from a noise source including its own reverberation.
The current model is closed source from xmos as you just get the model and obviously they keep much to themselves and have a pretty good knowledge of what thier NN (ML) libs are capable of…
That is if you where going to use the xmos but they also provide the adaptive AEC which supposedly works really well (ignoring what it is playing even if voice).
The xmos model is something like TAC but they have managed to get it thin enough to run on a microcontroller and don’t even think its quantised as the output is 32bit.
Its a guess but the results are sort of similar and not standard beamforming of the early smart speakers.

If it was targetted it would look for a voice that is in its profile like the Google Nest Audio do, which has a max of 6 individual users, its called VoiceMatch based on VoiceFilterLite that unfortunately they keep to theirselves…

2 Likes

Here you go :wink:

2 Likes