Yeah I am constantly setting timers in the kitchen, so would love to be able to get a voice activation to do this instead of getting flour/gravy/meat juice all over my android screen.
Set a timer for 40 minutes for the dough
Set a timer for 20 minutes for the cake in the oven
Set a reminder for 5:55 to tell me the guests are arriving at 6.
Extend the dough timer it needs another 10 minutes.
Thanks! Itâs funny, because the timer example is specifically what motivated some of the major changes in the upcoming version of Rhasspy. A timer grammar brings the current version to its knees during training because it has to explicitly generate all possible sentences. The new version finishes in milliseconds!
Have you taken a look at Snips NLU? It understands a couple of standard entities, such as datetimes, durations, numbers, temperatures, and so on, based on Duckling. Snips NLU is quite efficient.
Itâs been a while since Iâve looked at it. Looks like it would be a good addition to the list of intent recognizers. It should be pretty close already to the Mycroft Adapt code.
I need to figure out how to get GitHub to e-mail me when a PR is opened. It sends me an e-mail every time I commit, but never bothers to let me know about PRs or IssuesâŚ
Thanks a lot for taking the time to dig around the Rhasspy code; hope it wasnât too painful! I plan to accept the PR, but I have a somewhat philosophical question: given Rhasspyâs stance on being offline/private, do you think there should be a warning if Wavenet is enabled, or do you think itâs obvious enough for users that an internet connection is required and that information will be sent to Google, etc.?
I also need to add some fallback logic like that snips script had, so it will use a different TTS system if Wavenet isnât available and the sentence isnât in cache.
Only thing I could (and have) only build and pushed the amrhf docker image in order to use it in the hassio addon, I have a local copy which pulls from romkabouter/rhasspy-server on docker
That docker image was pushed by you Makefile script, I made some local changes because not all worked on my MacOS.
My local Rhasppy Addon work fine now, I will try to set up a demovideo soon.
I think it is good to mention that when using Google Wavenet, your text will go to the cloud and you need internet for that. But two important things about that to mention:
You only send to google only exactly the text you want
The sentences are cached, so the second time that exact sentence needs to be spoken it will use the cache file instead.
Cache is done by MD5 hashing the filename wavenet-voice_gender_samplerate_language and indeed a fallback system is a good idea.
I have added a fallback to eSpeak in case the system is offline and no cache file is available for the specific sentence
Iâm waiting on this one for a specific reason, actually. Iâve made some modifications to your code to bring it in line with the newer version of Rhasspy, and I donât want anyone to have to change their profile once I publish the new version.
Specifically, Iâm removing the RHASSPY_PROFILES environment variable in favor of a command-line option (--profile). Iâve taken out your environment variable for the TTS cache directory and replaced it with a profile setting in the JSON. This can be overridden on the command line via --set <NAME> <VALUE>, which overrides the profile setting <NAME> with <VALUE>, so you can still use an environment variable if you like.
All this is happening in a side branch, which should probably have been split into multiple feature branches long agoâŚ
Hi @synesthesiam I have been working on my Hermes Audio Server and it has been running for almost a week now without any problems as audio input and output for Rhasspy on another machine.
The only downside of this setup is that itâs continuously streaming audio on the network, so I implemented an initial version of a filter to only stream audio when voice activity is detected. This is using the same py-webrtcvad as you are using in Rhasspy to listen for voice commands.
But with this feature enabled (you can try it in the feat/vad branch of Hermes Audio Server), Iâm running into an issue with Rhasspy. The wake word is detected perfectly, but after this, Rhasspy keeps listening for 30 seconds (timeout_sec in the WebRTCVAD settings) when I give a command. I suspect that Hermes Audio Server already filtering out audio frames using VAD is interacting with the VAD filter in the command listener of Rhasspy in such a way that the latter doesnât detect the end of speech in the filtered audio frames.
I noticed that you also support the MQTT topics hermes/asr/startListening and hermes/asr/stopListening of the Hermes protocol as cues for the command listener to start and stop recording. But I canât publish these in the audio server, as this would start the ASR even before the wake word has been detected. The VAD in my audio server should not only work for the ASR component, but also for the wake word component.
Do you see a way to let this setup work with Rhasspy? Hermes Audio Server would only stream audio to Rhasspy when voice activity is detected, and Rhasspy should be able to use this filtered audio stream for both the wake word detection (which seems to be working nice as it is now) and the command listener. I donât know whatâs currently preventing the latter to work. Do I just need to change a certain configuration option for the webrtcvad component in Rhasspy to make it work, does this need some modification in Rhasspyâs code (possibly by listening to the hermes/voiceActivity/# topics), or will it work when I fine-tune the VAD in my audio recorder? Currently Iâm just using the is_speech method of webrtcvad, and Iâm hesitant to do much more pre-processing because that could prevent the wake word component from working.
Rhasspyâs âcommand listenerâ uses WebRTCVAD to inform a small state machine about the beginning and end of a voice command. It needs both the speech and the silence audio frames to know when to transition, and Iâm guessing your audio server is not sending the necessary silence frames after the command is spoken.
I think the easiest way to integrate your audio server with Rhasspy would be to (1) send (filtered) audio like you are until the wake word is detected, then (2) send all (unfiltered) audio until the command is complete. For (1), I could have Rhasspy emit the hermes/hotword/<wakeId>/detected event when the wake word is detected. For (2), Rhasspy should already be putting out the hermes/asr/textCaptured event. This is technically a little later than necessary; I could put out some event right when the voice command is finished, but before decoding has started.
Although itâs not documented, Rhasspy also emits events on rhasspy/<profile>/transition/# as each internal actor transitions between states. You could use these events to try stuff out before we nail down which Hermes events to map stuff to
I added some basic tests in test.py to try and make sure I donât break too much. For each profile, I have a test.wav and a test.json file in etc/test/<profile>. I generated the WAV file with Googleâs text to speech and, for some reason, the Dutch one doesnât get transcribed right. It should be âzet de woonkamerlamp aanâ, but it comes back as âzet de woonkamerlamp uitâ.
Are you having any problems with transcriptions like this?