HA Voice PE: Add post-processing step between "Conversation Agent" and "Speech to Text" step

sandvika · May 23, 2025, 1:19pm

I’m using “Google Generative AI” as my Conversation agent, and “Piper” for Text-to-Speech. This exceeded my expectations enormously at first, but soon Google started inserting markdown formatting into its answers. I have instructed it not to:

Answer questions about the world truthfully.
Keep it simple and to the point.
Do not use any markdown formatting or asterisks.

Unfortunately, the instruction to desist is either misunderstood or ignored and this results in Piper receiving the markdown and reading it aloud, so emphasis becomes “asterisk emphasis asterisk” which hinders my comprehension. There is also markdown evident for emoticons, which befuddles me every time. I hear “{something incomprehensible} winking face {something incomprehensible}” or similar.

The most simple solution would be to insert an optional post-processing stage that simply applies a regular expression that enables such markdown to be removed. However, I envisage a need for all manner of post-processing that could enhance rather than hinder comprehension. I think it would be better to make it configurable. For example, I expect I would benefit from emphasis becoming “{pause} emphasis {pause}”, or “{++volume} emphasis {–volume}”, employing whatever notation the Text-to-Speech stage is able to use to enhance vocal variety to provide a more human-like rendition. Such flexibility in post-processing would enable any conversation agent to be interfaced to any text-to-speech agent.

pajeronda · May 23, 2025, 1:48pm

Devi inserire nel prompt che non vuoi formattazioni markdown, html, ecc…

jackjourneyman · May 23, 2025, 2:04pm

Have you tried a different TTS engine? Elevenlabs, for example?

mib1185 · May 23, 2025, 2:42pm

Just for reference, there is an open architectural discussion about this:

mchk · May 23, 2025, 3:55pm

The basic problem is that Piper uses espeak-ng, whose dictionaries are not adapted for use for conversations. espeak aims to voice all characters, emoticons, numbers (without using normalization) and so on.

github.com/espeak-ng/espeak-ng

dictsource/en_list

61b14225a


      
          _"	kwoUts
          _$	d0l3
          _%	p3s'Ent
          ?5 _%	pVRs'Ent
          _&	amp@sand
          _'	t'Ik
          _[	lEftbrakI2t
          _]	raItbrakI2t
          _(	lEftpa#rEn
          _)	raItpa#rEn
          _*	ast@rIsk
          _,	k0m@
          _-	h,aIf@n
          ?3 _-	daS
          _.	d0t
          ?3_.p	pi@rI@d
          _:	koUl@n
          _;	sEmIk'oUl@n
          _<	lEsDan
          _=	i:kw@Lz
          _>	greIt@Dan

More work would be required to customize the dictionary for each language before training the voice for Piper. But that’s a huge amount of work, for a project with few people, and time was running out.
Now, if I were OHF, I would allocate resources to re-train the basic voice with these nuances in mind if they plan to continue using Piper as their primary tool.

Now we have several solutions to the existing problem.

Character replacement can be added to the Piper server code.
It is possible to create an intermediate proxy on Wyoming protocol, which will deal with normalization. Which is a more unified solution. Here is an example of a similar project for stt.
Or wait for a system solution from the developers.

Sketched out a test version of the proxy.

sandvika · May 24, 2025, 8:05am

ElevenLabs would take me in the reverse direction. I’m not interested in cloud agents, as for me HA is primarily about energy efficiency and cost reduction, then about high availability and self-sufficiency. Text-to-Speech is an important part of this as it enables HA to prompt me when intervention is required. It’s specifically the output format from Google that is problematic. I am open to trying other local TTS agents to achieve better results. I’m making as much as possible local, the conversation agent being the notable exception for the time being because I don’t want to consume hundreds of Watt-hours on a GPU that sees occasional use. A wake-on-LAN solution for numeric intensive computing / AI may be a future development.

jackjourneyman · May 25, 2025, 1:00am

If you’re using Google, that ship has already sailed!

I mention Elevenlabs because their TTS voices use context to provide a more human delivery - I believe their main business is creating voices for audiobooks. The results are quite impressive. I’m with you on making things local, though. My TTS falls back to a rather robotic Pico when the internet isn’t available.

My inference server is local, and it doesn’t need as many watt-hours as you might think because, as you say, the GPU is idling most of the time. It switches off when I leave the house and wakes again when I return, but I’ve found that any finer wake-on-LAN control is impractical because of the delays introduced.

sandvika · May 25, 2025, 11:16am

Therein lies the risk of pruning my explanation for brevity - well aware that I’m using cloud for the conversation agent right now, but it was after experiencing the built-in local processing agent that - putting it politely - has some way to go to get to where Alexa was a decade ago - whereas Gemini trounces anything that came before, in my experience! It even apologised for turning on the wrong light first whilst turning on the correct one, although it didn’t turn the wrong one off again. Thanks for sharing your experiences of wake-on-LAN. It doesn’t surprise me that it is slow since modern dynamic kernels need to do a lot of re-work when resuming from sleep, with a general presumption that the state of peripherals has not been retained. There were performance advantages to having static rebuilt kernels as was the norm in early Unix. The last time I recall needing to rebuild a static kernel though was with DYNIX/ptx on a Sequent NUMA-Q as part of a Y2K compliance project! I expect I could use a similar approach to you, but combine it with presence sensors, so that the kit gets woken up and is kept awake whilst I’m in the room(s) where I can use it.

scarbajali · May 31, 2025, 9:08pm

IMO, just parsing the markdown format from LLMs would be already a great improvement. A feature request is already in place: