I was able to connect an I2S mic to an ESP32 and set everything up so that I can send some voice commands to HAOS.
It sort of works, but it drives me crazy when it understands “cut” instead of “cat”, “China Town” instead of “Chinatown” and so on.
I think (and did) setting some aliases help to allow variations but it’s getting to be too many permutations.
I’m thinking maybe the signal/noise levels might be “optimizable” but I can’t figure out any way to actually listen to the audio that gets processed.
Is there a way? If not, could one be implemented?
Thanks
Is there any way to "listen" to the voice going to the assistant from ESP32 for debugging purposses?
Separately but related (It’s still CRUCIAL to listen to the audio); is there a way to set some way to “connect” similar sounding words to ONE word? Using the above example, to make it understand “Chinatown” even if it thinks it’s “China town”?
I don’t know how to do this or if it is even possible but if you make sure to add the
logger:
component to the esp yaml you can see the transcribed words.
That’s how I see “cut” instead of “cat”, for example.
Useful but not nearly enough.
Not sure if there is any difference in the piper language models.
For reference I am running
en-us-libritts-high
In my piper tts model.
Using
--model small --language en
In whisper for stt
Sorry; I’m not sure how this might help.
Just wondering if we are using different speech to text models and there was differences.
@catacluj as the question is a bit older already and I am looking for the same thing: Did you figure out a way to get the raw recording sent to HA from ESP32 to be recognized by whisper? I am quite certain it exists as a temporary file somewhere along the pipeline.
I similarly suffer from assistant sometimes recognizing slightly different words than what I was trying to speak.
This also makes me wonder whether the model is too generic. It does not need to be able to regognize thousands or words. Just the words contained in the sentences plus the names of the entities is good enough.
Maybe there is a way to fine-tune the model? “small” model still takes 15 seconds to recognize a prompt on my x86 NAS and by having less words included in the model this might improve as well.
Before starting such experiments I would then want to have a collection of sample prompts as received by the ESP32.
Put this in your confiuration.yaml
It will put files in the share/assist_pipeline folder
Youll want to turn this off after testing. It will take up space on your drive
assist_pipeline:
debug_recording_dir: /share/assist_pipeline
Thank you @Rich37804; did that and found my audio is quite good.
@stephankn that would be ideal; I don’t know how to do that. I found that I can run Wyoming separately much faster separately on a faster PC; even faster using GPU (instant!) which might allow me to use a larger model; maybe with better results.
Anyway, this problem is sorted now; I’ll make another post about the next one