Rhasspy offline voice assistant toolkit

nickrout · May 11, 2019, 8:40am

Yeah I am constantly setting timers in the kitchen, so would love to be able to get a voice activation to do this instead of getting flour/gravy/meat juice all over my android screen.

Set a timer for 40 minutes for the dough

Set a timer for 20 minutes for the cake in the oven

Set a reminder for 5:55 to tell me the guests are arriving at 6.

Extend the dough timer it needs another 10 minutes.

Romkabouter · May 11, 2019, 11:16am

Nice, but there is something similar already

Romkabouter · May 11, 2019, 11:18am

Snips already is able to wake up Rhasspy.
Set rhasspy to: Wake up on MQTT message

When a hotword comes from Snips, it activates rhasspy as well.
If you do not have the addon, you can simply start only the hotword service

koan · May 11, 2019, 11:28am

I had a look at OpenSnips, but running a 11 Gbyte Docker container just to have a remote audio server on my Pi for Rhasspy seemed a bit overkill

That said, the project is looking awesome, so I’m watching it and maybe there could be some cross-pollination between OpenSnips and Rhasspy.

Romkabouter · May 11, 2019, 11:36am

True, but you also run only just Snips Audio Server. The services are all different services

I have open two PR’s by the way @synesthesiam, to be able to use Google Wavenet as TTS.
Works perfect for me locally

synesthesiam · May 11, 2019, 6:15pm

Thanks! It’s funny, because the timer example is specifically what motivated some of the major changes in the upcoming version of Rhasspy. A timer grammar brings the current version to its knees during training because it has to explicitly generate all possible sentences. The new version finishes in milliseconds!

koan · May 11, 2019, 9:39pm

Have you taken a look at Snips NLU? It understands a couple of standard entities, such as datetimes, durations, numbers, temperatures, and so on, based on Duckling. Snips NLU is quite efficient.

synesthesiam · May 12, 2019, 2:32am

It’s been a while since I’ve looked at it. Looks like it would be a good addition to the list of intent recognizers. It should be pretty close already to the Mycroft Adapt code.

synesthesiam · May 12, 2019, 2:37am

I need to figure out how to get GitHub to e-mail me when a PR is opened. It sends me an e-mail every time I commit, but never bothers to let me know about PRs or Issues…

Thanks a lot for taking the time to dig around the Rhasspy code; hope it wasn’t too painful! I plan to accept the PR, but I have a somewhat philosophical question: given Rhasspy’s stance on being offline/private, do you think there should be a warning if Wavenet is enabled, or do you think it’s obvious enough for users that an internet connection is required and that information will be sent to Google, etc.?

I also need to add some fallback logic like that snips script had, so it will use a different TTS system if Wavenet isn’t available and the sentence isn’t in cache.

Romkabouter · May 12, 2019, 5:49am

No, is was ok

Only thing I could (and have) only build and pushed the amrhf docker image in order to use it in the hassio addon, I have a local copy which pulls from romkabouter/rhasspy-server on docker
That docker image was pushed by you Makefile script, I made some local changes because not all worked on my MacOS.
My local Rhasppy Addon work fine now, I will try to set up a demovideo soon.

I think it is good to mention that when using Google Wavenet, your text will go to the cloud and you need internet for that. But two important things about that to mention:

You only send to google only exactly the text you want
The sentences are cached, so the second time that exact sentence needs to be spoken it will use the cache file instead.

Cache is done by MD5 hashing the filename wavenet-voice_gender_samplerate_language and indeed a fallback system is a good idea.
I have added a fallback to eSpeak in case the system is offline and no cache file is available for the specific sentence

koan · May 12, 2019, 8:32am

It’s in GitHub’s notification settings under ‘Email notification preferences’.

Romkabouter · May 13, 2019, 5:51pm

Thanks for accepting the PR @synesthesiam, also the PR from https://github.com/synesthesiam/hassio-addons/pulls needs to be in there, otherwise it wont work

synesthesiam · May 14, 2019, 3:01pm

I’m waiting on this one for a specific reason, actually. I’ve made some modifications to your code to bring it in line with the newer version of Rhasspy, and I don’t want anyone to have to change their profile once I publish the new version.

Specifically, I’m removing the RHASSPY_PROFILES environment variable in favor of a command-line option (--profile). I’ve taken out your environment variable for the TTS cache directory and replaced it with a profile setting in the JSON. This can be overridden on the command line via --set <NAME> <VALUE>, which overrides the profile setting <NAME> with <VALUE>, so you can still use an environment variable if you like.

All this is happening in a side branch, which should probably have been split into multiple feature branches long ago…

Romkabouter · May 14, 2019, 3:33pm

great

koan · May 15, 2019, 10:18am

Hi @synesthesiam I have been working on my Hermes Audio Server and it has been running for almost a week now without any problems as audio input and output for Rhasspy on another machine.

The only downside of this setup is that it’s continuously streaming audio on the network, so I implemented an initial version of a filter to only stream audio when voice activity is detected. This is using the same py-webrtcvad as you are using in Rhasspy to listen for voice commands.

But with this feature enabled (you can try it in the feat/vad branch of Hermes Audio Server), I’m running into an issue with Rhasspy. The wake word is detected perfectly, but after this, Rhasspy keeps listening for 30 seconds (timeout_sec in the WebRTCVAD settings) when I give a command. I suspect that Hermes Audio Server already filtering out audio frames using VAD is interacting with the VAD filter in the command listener of Rhasspy in such a way that the latter doesn’t detect the end of speech in the filtered audio frames.

I noticed that you also support the MQTT topics hermes/asr/startListening and hermes/asr/stopListening of the Hermes protocol as cues for the command listener to start and stop recording. But I can’t publish these in the audio server, as this would start the ASR even before the wake word has been detected. The VAD in my audio server should not only work for the ASR component, but also for the wake word component.

The Hermes protocol also supports the hermes/voiceActivity/<siteid>/vadUp and hermes/voiceActivity/<siteid>/vadDown messages, which are published when the Snips hotword detector detects the start or end of voice audio (if the user enables this). I have implemented publication of these messages by my audio recorder.

Do you see a way to let this setup work with Rhasspy? Hermes Audio Server would only stream audio to Rhasspy when voice activity is detected, and Rhasspy should be able to use this filtered audio stream for both the wake word detection (which seems to be working nice as it is now) and the command listener. I don’t know what’s currently preventing the latter to work. Do I just need to change a certain configuration option for the webrtcvad component in Rhasspy to make it work, does this need some modification in Rhasspy’s code (possibly by listening to the hermes/voiceActivity/# topics), or will it work when I fine-tune the VAD in my audio recorder? Currently I’m just using the is_speech method of webrtcvad, and I’m hesitant to do much more pre-processing because that could prevent the wake word component from working.

synesthesiam · May 15, 2019, 8:55pm

Rhasspy’s “command listener” uses WebRTCVAD to inform a small state machine about the beginning and end of a voice command. It needs both the speech and the silence audio frames to know when to transition, and I’m guessing your audio server is not sending the necessary silence frames after the command is spoken.

I think the easiest way to integrate your audio server with Rhasspy would be to (1) send (filtered) audio like you are until the wake word is detected, then (2) send all (unfiltered) audio until the command is complete. For (1), I could have Rhasspy emit the hermes/hotword/<wakeId>/detected event when the wake word is detected. For (2), Rhasspy should already be putting out the hermes/asr/textCaptured event. This is technically a little later than necessary; I could put out some event right when the voice command is finished, but before decoding has started.

Although it’s not documented, Rhasspy also emits events on rhasspy/<profile>/transition/# as each internal actor transitions between states. You could use these events to try stuff out before we nail down which Hermes events to map stuff to

koan · May 15, 2019, 9:17pm

Awesome, that’s exactly the information I needed! I’ll try this soon.

koan · May 15, 2019, 10:15pm

By the way, I see “Dutch tests not passing yet” in one of your recent commits. How can I help?

synesthesiam · May 16, 2019, 1:57am

Thanks for the offer

I added some basic tests in test.py to try and make sure I don’t break too much. For each profile, I have a test.wav and a test.json file in etc/test/<profile>. I generated the WAV file with Google’s text to speech and, for some reason, the Dutch one doesn’t get transcribed right. It should be “zet de woonkamerlamp aan”, but it comes back as “zet de woonkamerlamp uit”.

Are you having any problems with transcriptions like this?

Romkabouter · May 16, 2019, 7:39am

The wav file contains the correct sentence, say it that “aan” has a little pitchbend