Alternate TTS renderer

Just wanted to pick the brains of some people far more capable than me for advice on a new project that I’m playing with:

I recently started experimenting with Voice Assist on my Android phone using the companion app in advance of moving over to Voice Assist PE units at some point (depending on future developments regarding wake-words, etc.). I’ve got Whisper and Piper running in Docker containers on a mini-PC and I’m loving it so far – response times are tolerable and the TTS sounds great. My only issue with my current set up has been that I’m not completely satisfied with having the responses broadcast over my phone speaker.

So the other day I finally found a temporary solution until I find something more robust – I’ve installed Kodi on my PC and enabled DNLA, then set Kodi to display on a non-existant desktop tied to the Sony receiver connected to that PC. I set the response to conversation to " " in my automation, generate TTS with Piper and send it to Kodi to be played over my HTPC receiver.

Works like a charm, but a little clunky and inconvenient. Does anyone know of a way of just directing Assist to use a device specified in a variable (perhaps a Helper) so that I can then automate responses to be sent to entities based on occupancy? For example, if Den occupancy is true and Loft occupancy is false, then send TTS to Kodi_Den type of thing. If I can bunch this stuff together with a couple of helpers and scripts, that would be perfect.
Is this possible or am I being to idealistic again? TIA

I’ve been looking for something like this for a while - it seems to be a yawning gap in the whole Assist ecosystem.

You can do it with custom sentences and intents, obviously, but the whole voice assistant world seems to be focused on the Amazon Alexa model, with microphone and speaker in the same voice client device. For them it was a commercial decision, but it’s blocked off a lot of alternative development paths.

1 Like

If I understand you correctly, you can create a copy of your Assist pipeline, but with a different tts output.
Then you can create an automation that will change the pipeline depending on the conditions.

Sounds interesting, but I haven’t been able to find anything outlining how to change these pipeline settings that you’re talking about.

All I’ve dealt with so far involves specifying the TTS engine/voice to use when defining Voice Assistants in settings and not the “renderer” that will produce the actual audio. Besides, that wouldn’t work either since it needs to be dynamic – I won’t hear a response being played in the loft if I’m in the basement, obviously.

It seems to me that an ideal solution would be a helper that can be set to your desired “responder” to: 1) the origin of the query/command (ie. the Voice Assist device that heard the wake word/command), 2) any one of your other “satellites” (including the Android companion app), or 3) any of your DNLA devices that you can broadcast TTS to. That way you could have it change based on occupancy rather than just listening device.

Now, to go off on a little tangent – I’m sure there are people thinking, “Well why would Nabu Casa want to shoot themselves in the foot like that and reduce reliance on their Voice Assistant hardware?”. but here’s the thing – the focus of these devices should be voice recognition and control. The speaker is just there for convenience, but let’s face it, there will (in all probability) never be audiophile Voice Assist devices – nor should there be. That would be a distraction from where their effort should be better focused: building a device that will hear us properly, even with ambient noise and/or music playing. We can’t expect Nabu Casa to be masters at everything, so leave the actual audio rendering to products that specialize in that department.
If I could pepper these things around my house as I’ve done with these %*(&ing Echo pieces of junk, I’d be very happy to buy as many as I need for full coverage and discard the others, especially considering the privacy aspect. Who cares if I have open mics throughout my entire home if my own server is the only one listening? Anyway, just my thoughts on why they may be hesitating on implementing that level of flexibility and my reasoning on why they should. :man_shrugging:

I found another option in case anyone finds this post while trying to come up with work-arounds –
If you install the fork of HASS.Agent, you can send speech to the built-in multimedia player on your PC instead of installing Kodi. It wasn’t working correctly until I discovered the fork version.