HA Voice PE: Add post-processing step between "Conversation Agent" and "Speech to Text" step

I’m using “Google Generative AI” as my Conversation agent, and “Piper” for Text-to-Speech. This exceeded my expectations enormously at first, but soon Google started inserting markdown formatting into its answers. I have instructed it not to:

Answer questions about the world truthfully.
Keep it simple and to the point.
Do not use any markdown formatting or asterisks.

Unfortunately, the instruction to desist is either misunderstood or ignored and this results in Piper receiving the markdown and reading it aloud, so emphasis becomes “asterisk emphasis asterisk” which hinders my comprehension. There is also markdown evident for emoticons, which befuddles me every time. I hear “{something incomprehensible} winking face {something incomprehensible}” or similar.

The most simple solution would be to insert an optional post-processing stage that simply applies a regular expression that enables such markdown to be removed. However, I envisage a need for all manner of post-processing that could enhance rather than hinder comprehension. I think it would be better to make it configurable. For example, I expect I would benefit from emphasis becoming “{pause} emphasis {pause}”, or “{++volume} emphasis {–volume}”, employing whatever notation the Text-to-Speech stage is able to use to enhance vocal variety to provide a more human-like rendition. Such flexibility in post-processing would enable any conversation agent to be interfaced to any text-to-speech agent.

Devi inserire nel prompt che non vuoi formattazioni markdown, html, ecc…

Have you tried a different TTS engine? Elevenlabs, for example?

Just for reference, there is an open architectural discussion about this:

The basic problem is that Piper uses espeak-ng, whose dictionaries are not adapted for use for conversations. espeak aims to voice all characters, emoticons, numbers (without using normalization) and so on.

More work would be required to customize the dictionary for each language before training the voice for Piper. But that’s a huge amount of work, for a project with few people, and time was running out.
Now, if I were OHF, I would allocate resources to re-train the basic voice with these nuances in mind if they plan to continue using Piper as their primary tool.

Now we have several solutions to the existing problem.

  • Character replacement can be added to the Piper server code.
  • It is possible to create an intermediate proxy on Wyoming protocol, which will deal with normalization. Which is a more unified solution. Here is an example of a similar project for stt.
  • Or wait for a system solution from the developers.

Sketched out a test version of the proxy.

2 Likes

ElevenLabs would take me in the reverse direction. I’m not interested in cloud agents, as for me HA is primarily about energy efficiency and cost reduction, then about high availability and self-sufficiency. Text-to-Speech is an important part of this as it enables HA to prompt me when intervention is required. It’s specifically the output format from Google that is problematic. I am open to trying other local TTS agents to achieve better results. I’m making as much as possible local, the conversation agent being the notable exception for the time being because I don’t want to consume hundreds of Watt-hours on a GPU that sees occasional use. A wake-on-LAN solution for numeric intensive computing / AI may be a future development.

If you’re using Google, that ship has already sailed! :grin:

I mention Elevenlabs because their TTS voices use context to provide a more human delivery - I believe their main business is creating voices for audiobooks. The results are quite impressive. I’m with you on making things local, though. My TTS falls back to a rather robotic Pico when the internet isn’t available.

My inference server is local, and it doesn’t need as many watt-hours as you might think because, as you say, the GPU is idling most of the time. It switches off when I leave the house and wakes again when I return, but I’ve found that any finer wake-on-LAN control is impractical because of the delays introduced.

1 Like

Therein lies the risk of pruning my explanation for brevity :joy: - well aware that I’m using cloud for the conversation agent right now, but it was after experiencing the built-in local processing agent that - putting it politely - has some way to go to get to where Alexa was a decade ago - whereas Gemini trounces anything that came before, in my experience! It even apologised for turning on the wrong light first whilst turning on the correct one, although it didn’t turn the wrong one off again. Thanks for sharing your experiences of wake-on-LAN. It doesn’t surprise me that it is slow since modern dynamic kernels need to do a lot of re-work when resuming from sleep, with a general presumption that the state of peripherals has not been retained. There were performance advantages to having static rebuilt kernels as was the norm in early Unix. The last time I recall needing to rebuild a static kernel though was with DYNIX/ptx on a Sequent NUMA-Q as part of a Y2K compliance project! I expect I could use a similar approach to you, but combine it with presence sensors, so that the kit gets woken up and is kept awake whilst I’m in the room(s) where I can use it.

IMO, just parsing the markdown format from LLMs would be already a great improvement. A feature request is already in place:

I got past your specific problem by simply adding this to my general conversation guidelines. You might give this or something similar a go:

  • While I appreciate emojis to emphasize your comments, they are not effective in TTS. If you would like to provide emphasis to a comment, use words instead.

I have found a lot of “post-processing” you might consider, is probably more easily avoided by making better conversation guidelines for your conversation agent to adhere to.

I was also mildly annoyed by the conversation agent saying it was “nine-zero-seven am” when asked the time between minutes 01-09 on any given hour. I simply told the conversation agent:

  • When replying with a time that includes a leading 0 in the time format like “9:07” then refer to that zero with the letter “O.” In the example, “9:07” would be “Nine O’Seven.”

Thank you for your suggestions. I have mostly got there with the following instruction:

You are serving a voice pipeline that malfunctions when * symbols are used. Never use * or any other markdown. Only use ASCII alphanumeric characters in your answers.

I have also created a variable that tells HA who is using it and scripts that get invoked by the conversation agent to switch between them - then provides per-user customisation. Just scratching the surface of what is possible with this!

If I say “I am Sandvika” run script “Sandvika”. If I say “I am Dog” run script “Dog”. If I say “I am Petal” run script “Petal”.
My name is stored in the entity “User”.
If my name is Sandvika
(I understand English, German and French. Please sprinkle a few German and French expressions into your answers. )
Else If my name is Dog
(I also like being called by my nicknames “Pooch” and “Sweet”. I love everything about dogs. Please sprinkle a few expressions about dogs into your answers.)
Else If my name is Petal
(I also like being called by my nicknames “Flower” and “Sunflower”. I love all flowers and plants. Please sprinkle a few horticultural expressions into your answers.)

I have recently seen a video on YouTube advocating Apple silicon for local LLM as it pools the memory between CPU and GPU and uses almost nothing whilst idle. I was intending to get a Mac Studio anyway, this may well be my answer!