xTTS v2 with Home Assistant

I’m wondering if anyone has been able to leverage the xtts v2 model from coqui-tts… I was finally able to get tts-server to work with a fine tuned model + was able to fix some issues that popped up with the MaryTTS endpoint.

It seems like the better route for this would be to stream the chunked audio though (it takes a couple of seconds to gen. the full audio) and that is outside of my abilities.

Has anyone been able to run a fine tuned model with audio streaming?

I’ve just run it, it required minor changes to the server.py file. But it has API GET /api/tts?text=Blabla which I do not know how to connect to HA yet :slight_smile:

You can use MaryTTS to connect it.

Are you using a fine tuned model?

Oh! thanks, you saved me time for research. Not tuned yet but it’s not different than tuned.

It seems I had to modify two files:

  1. TTS/utils/synthesizer.py

In line 183 I brutally provided config file for xxts2:

        #self.tts_config = load_config(tts_config_path)
        self.tts_config = load_config("/home/tts/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2/config.json")
  1. TTS/server/server.py

In line 203 I brutally put the parameters I wanted:

        wavs = synthesizer.tts(text, speaker_name="Gitta Nikolina", language_name="en", style_wav=style_wav)

and here you just simply modify it to include your own improvements.

Voila!

Here’s some More info too

I must say that quality of that TTS is astonishing!

Agreed!
Let me know if you try a fine tuned model. For some reason on my side, the spoken words seem a little slow for some reason…

have you verified cuda is turned on for the server?

Yeah it’s running on the GPU. The actual inference is fine, just the pace in which words are spoken seems kind of slow.

The coqui tts docs mention something about a speed parameter but I have no idea how to use it

EDIT: it would be way better to do chunked audio streaming but I have no idea how to do that. I believe xtts-streaming-api allows for that, but I don’t think that is supported in HA

Warning: XTTS-streaming-server doesn't support concurrent streaming requests, it's a demo server, not meant for production.

So it’s a big risk for HA where parallel TTS can happen frequently. We need to wait until there is a better server.

but we have this!

Yup I saw that one too, it just does not appear to have a MaryTTS compatible endpoint so some integration would be required on the HA side

ok I figured out inference speed and made changes to xtts.py

EDIT:
I was wrong, I’ve been having a really hard time trying to get this to pull “speed” from the model config.json… im not a programmer and am having difficulty setting a global variable in the TTS.config module.
For whatever reason, the value I’m trying to set which should be the path of config.json isnt accessible in xtts.py

Finally got it figured out.
It required changed to the config module, server.py and xtts.py

but now at least it looks for the speed variable from the config.json of the currently loaded model.

EDIT:
modified links. I accidentally linked the original files and not the modified ones.

it works for me perfectly, i am still shocked how good this TTS is

Try running a fine tuned model!
I trained via Google colab and the result is great!

1 Like

I saw a youtube with the MaryTTS and it was amazing. It isn’t possible to run all of that on a Raspberry Pi 4 with Home Assistant, is it?

I started to research that day but saw on the coqui homepage that it is discontinued so I am surprised you guys are still running it.

I’m running all of the coqui stuff (xtts v2 model) locally :slight_smile:

The main reason I’m using coqui tts-server is because it’s the only one I found that has a maryTTS compatible endpoint so it works with home assistant

What hardware are you running on, is it too demanding for the raspberry pi with Hassio on it?