Rhasspy offline voice assistant toolkit

Thanks @farfade for the shout-out on the Snips forum. I’d welcome motivated developers coming to help with Rhasspy. I think we have a chance here to make something that can help a lot of people.

That’s a good question. I’m definitely motivated to maintain and enhance Rhasspy so it can reach as many people as possible, but I also want to make sure I’m not the only person keeping the project alive. If something happens to me, I want to ensure that someone out there can keep building and releasing versions of Rhasspy. Any thoughts on this are welcome.

It might. I need to understand the tech in snips-nlu better before I can say for sure. I also need to see what’s available when someone exports their Snips data. It should be possible to add snips-nlu as another intent recognition system in Rhasspy. If there’s some way to import your Snips training sentences too, it would save you the trouble of converting them to Rhasspy’s sentences.ini format.

In the longer term, I’d like to break apart Rhasspy into similar functional pieces to Snips, so each piece could be worked on separately. Maybe we (the Rhasspy community) should fork the Hermes protocol and make it our own! I vote we rename it Zoidberg (yes, I know they weren’t talking about that Hermes) :wink:

1 Like

As a non dev, I don’t get a vote, but zoidberg has my vote in any case.

1 Like

I am in the process of this, but my time is rather limited these days.

As to the name of the hermes protocol, I don’t see the need to rename really. It is not patented or anything :wink:

1 Like

Have a look at my Snips apps. Yesterday I exported all my data from the Snips Console and put it in the repositories of my apps. For instance snips-app-what-is-happening/console/en at master · koenvervloesem/snips-app-what-is-happening · GitHub.

1 Like

Have a look at Hermod. I don’t know if it’s still in development, but Steve has put a lof of thought into it.

If you want to build upon Hermes, I want to help, but I prefer to do it component by component, and currently the Rhasspy codebase is not broken apart enough for me to do this.

Moreover, we have to think about the differences between the MQTT API, REST API and other mechanisms: they shouldn’t diverge too much.

This is just the thing I was looking for, thanks! I especially like that the siteId is baked into the MQTT topic, so you don’t have to parse a message just to throw it away.

I was thinking we could do this in a way that makes sense for MQTT, Websockets, and HTTP.

  • For MQTT, everything would go as normal
  • For Websockets, there would be websocket endpoints in the Rhasspy web server for each topic that would send outgoing messages to the client and receive incoming messages
    • Something like /api/hermod/<siteId>/microphone/start
    • An alternative might be a single endpoint that you can subscribe to messages like in Home Assistant
    • Messages would be passed from MQTT to Websocket and back
  • For HTTP, there could be endpoints for relevant message pairs or incoming messages
    • So POST-ing to /api/hermod/<siteId>/nlu/parse will return the JSON payload from hermod/<siteId>/nlu/intent (or fail)
    • POST-ing to something like /api/hermod/<siteId>/microphone/start will inject the message and return immediately

Thoughts?

I was mostly kidding, but I still want to name one thing in my life after Zoidberg :wink: My wife said no to our son, so…

Thank you @thinker for the tips. I’ll try Rhasspy as soon as I’ve got spare time :slight_smile:

One other question for you and @synesthesiam : with snips, I had one master server making all the processing (nlu…) of audio sent by two remote (on my LAN) raspberry running snips-satellite (hotword detection and audio input). I did that because processing on the raspberries with snips was too long for me, and because I think it is an architecture helping to deal with identifying sessions (when two mikes hear the same voice as my house is not so large). The satellites and the main processing unit where integrated with MQTT / Hermes.

Do Rhasspy support such and architecture ? Or how do you do with Rhasspy to manage efficiently two or more remote mikes ?

Came here to ask this - managed to find this:

1 Like

I think I misunderstood my problem. I am french and I thought that the snips opensource part would help better speech-to-text and intent recognition for french. It’s not so clear to me, but it seems that the French recognition is a speech-to-text problem that snips solved with a privately held model built on top of kaldi.

The best free Kaldi ASR models I’ve come across are from zamia-speech. I use their TDNN English and German models in Rhasspy. They don’t have a pre-trained French model yet, but I see they have a few hundred hours of French speech data from various corpora.

If someone could try to get the zamia scripts running and generate a French Kaldi model, I could easily add it to Rhasspy.

Maybe call your wife Zoidberg :rofl:

This guy gets it :smiley:

It works perfectly after your update. Impressively fast support, thank you! :+1:
To focus more on Rhasspy I’m working with Docker again. I was able to optimize the speed a bit :slight_smile: At a later stage, when Rhasspy’s modularization is further advanced, small units running as service on the Raspberry Pi would of course be great.

Good news. I think even though many Rhasspy installations will probably be used for home controls, the ARM (Raspberry Pi) branch is important for the future for applications like robots, small mobile systems etc.and should therefore be maintained in any case.

1 Like

Not (yet) a finished solution, but thanks to the MQTT support, which can also stream audio, such a solution is already feasible with some effort.

2 Likes

I found a Zamia Speech contributor named Paul Guyot that built a nnet3 French Kaldi model here

This is what I am currently using for my tests to replace Snips ASR with Kaldi.

Hope this helps :blush:

2 Likes

You are correct. The open source NLU library from Snips supports many languages but the really hard part is the ASR to transcribe speech to text (for English there are lots of already good models available but for French they are not there yet). Snips indeed solved this by using an in house nnet3 trained model for Kaldi they customized for every assistant. Their model seems pretty small for the performance it provides (surely that’s partly why Sonos bought them). Not sure if open solutions can get on par with it… This is what I’m currently testing.

2 Likes

This is perfect, thank you! I’ve just pushed a new version of Rhasspy with the French Kaldi profile. Big thanks to @fastjacksprt and pguyot.

What do you think of this for slots:
https://kaldi-asr.org/doc/grammar.html

Looks a lot like what Snips is doing for their language model entities.

1 Like

I’ve tested the new Kaldi french model and for now it is performing really good.
I’ll add more intents to see how it behave with a more complex LM and test it from a distance with a far-field mic.

Though, the recognition speed is slow (compared to Pocketsphinx), I’m sure this can be improved in some way like using a single instance of Kaldi via py-kaldi-asr instead of executing a CLI that needs to load the AM+LM each time (using a remote HTTP server?).

Pushing chunks of WAV audio to the ASR (like Snips does/did) will also speed up the process and avoid POSTing a complete WAV file to the remote server.
For example:

If using the Kaldi GrammarFst approach mentionned above works it’ll allow for an even lighter language model for Kaldi, reducing the memory footprint for large slots (artists, songs, cities, etc.)…

Allowing the use of rule based slots should also help for special slots like date/time, temperature, number, ordinal, etc.

Rhasspy is begining to look really sexy as a Snips alternative :slight_smile:

2 Likes

Glad to hear the model is working! I’ve already got a start on making the Kaldi recognition faster, actually based on the Zamia library you linked :slight_smile: This is currently implemented in Rhasspy’s “sister” project named voice2json (warning: very much in beta). Recognition with the nnet3 Python extension is about 3x faster over the shell script.

I’m very interested in the GrammarFst approach. I’m thinking it would be best to try and fold this into rhasspy-nlu. Perhaps we could include pre-generated FSTs for different slot types in the language-specific profile downloads?