Setting up Coqui's llm-server to utilize voice cloning via a speaker.wav file... Then utilizing it as tts in HA via marytts

zminer123 · March 31, 2024, 4:40am

Hello all,
Please forgive me if my topic is ignorant or poorly worded; I am not used to making posts on forums (I’m usually a lurker), but I’ve finally found a specific enough thing that I haven’t been able to search. I am running coqui’s tts-server on my linux based computer, and would like to use its maryttm api in order to provide tts support to HA. I have an audio file that I would like to use with the xtts_v2 model in order to “clone” the voice of someone. I am able to run tts-server with other models, but not this one. When I am able to run it with other models, everything works as expected (including the marytts integration) and I can use any of the voice models premade for tts-server… This is already amazing a huge thing, but I just can’t figure out how to get it to run in a manner where I can specify my “clone voice” wav file and have that be the voice that is used for the server. I have read several posts by other users, including a few that even (seemingly) got it to work, but I couldn’t follow their workflow well enough to make it work myself.

Would anyone be able to walk me through the process of either modifying the tts-server configuration in order to set the voice clone wave, or suggest an alternative (free and open source) method to achieve the goal?

Alternatively, I am able to run the project “alltalk” (GitHub - erew123/alltalk_tts: AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.) which is VERY promising, but its api endpoint is different from marytts, and therefore I can’t use it for TTS support.

Thank you in advance for any help anyone can provide!