I’ve gained so much from this amazing community, and I wanted to give something back.
To that end, I’ve created a pre-processor API endpoint for PiperTTS using the NVIDIA NeMo framework. The idea came to me after watching the latest Thorsten-Voice video.
This pre-processor is packaged as a stand-alone Docker image, so it can be easily deployed anywhere. I’ve also made a pre-built package available for convenience.
It’s particularly useful for handling dynamic text input with PiperTTS—think weather reports, stock alerts, and other real-time applications. I’m also planning to enhance the Text Normalization (TN) logic to make it even better for TTS use cases.
If you’re interested, you can check it out on GitHub. Hopefully, this tool adds value to someone else in the community!
@jaycollett : Thanks for your work, i’m happy i could inspire you . Hopefully i can give it a try soon.
@Rudd-O : It’s to modify a text to make it easier to pronounce by tts (as piper) service. Remove abbreviations, convert integer / numbers to their written form, … .
Imagine the following text.
Original (better to read for humans, but harder for tts)
Dr. Smith paid $1,234 for 2 items at 3pm after waiting outside at 72°F on may, 15th, 2024. While waiting for the train to arrive at 15:45 he called a support hotline at 1-800-555-0123.
Normalized / Cleaned (better for tts pronounciation):
Doctor Smith paid one thousand two hundred thirty-four dollars for two items at three p m after waiting outside at seventy-two degrees Fahrenheit on May fifteenth, twenty twenty-four. While waiting for the train to arrive at fifteen forty-five he called a support hotline at one eight hundred five five five zero one two three.
Here’s an audio sample on the original and the normalized version:
This is a great initiative! It’s one of my biggest frustrations with voice and I’ve been trying to resolve it on the prompts side, but without a lot of success.
For example, I’ve added to the prompt that the LLM should not use 12 hour but instead 24 hour, no AM or PM, in the responses - but it still tells me the time in am and pm. I’ve added to the prompt that it shouldn’t be using abbreviations and that the text is to be used for text to speech, it also doesn’t help. At least it seems I’ve mostly been able to stop it from using american customary units through the prompt.
So at least if there can be some text normalization in the pipeline, that would be great! How is the image/package supposed to be used in the Home Assistant pipeline? Should the TTS service use this and then it forwards the result to piper? Or does it have to be reimplemented in various automations but not as a general drop-in?
Also, how is the normalization performed? I had a quick look at the NeMo page but it’s not clear to me how the framework is being used for text normalization here. Would you be able to share your approach? It would be great to understand so it would be easier to contribute.
For example, when piper gets a list (for example of lights in an area), the number for the next item is said with the previous item because there is a “full stop” after the number:
Left spot light
Right spot light
Becomes “One <pause> Left spot light two <pause> Right spot light”
As you can see, I’m very exited about a comprehensive approach to fixing the TTS pipeline!
Ideally, this would be embedded in the Wyoming Piper TTS solution, perhaps with a setting to apply Text Normalization (TN) or not. But until that’s a reality I’ve decided to make do with this solution. I’m working through understanding the mechanics of the NeMo framework myself, but the results thus far have been outstanding in every scenario I sent it’s way!
You need to run the docker container on a machine that your HA instance can hit, which could be the same as HA if it’s not a low-power system like a PI. Then create a rest command on your HA instance and use that to pre-process any dynamic text you want to leverage. If you can gain access to the text, you can send that over to the API and get back normalized text to send to Piper or any other TTS engine for that matter, it’s not specific to Piper.
So create a rest command, something like this:
normalize_text:
url: "http://<YOUR API IP HERE>:5000/normalize"
method: POST
headers:
Content-Type: application/json
payload: '{"text": "{{ text | replace("\n", " ") | replace("\"", "\\\"") }}"}'
timeout: 5 # Timeout in seconds
Once you created/configured the rest_command in HA, you can leverage it in scripts or automation. In the example automation here, I take the text I want to be converted to speech and feed it to the API endpoint, the API will return the string of text normalized, in this example, the “message” is the text I want to normalize. Then in step 5 of this example, you can see that I just take the normalized text from the API call and send it over to Piper to convert to speech and play on one of my media_player end-points.:
# Step 2: Call the API and process the response
- action: rest_command.normalize_text
response_variable: normalized_response
data:
text: "{{ message }}"
# Step 3: Check if the API response was successful
- if: "{{ normalized_response['status'] == 200 }}"
then:
# Step 4: Log the normalized text
- service: logbook.log
data:
name: "Weather Report"
message: >
Normalized Text: {{ normalized_response['content']['normalized_text'] }}
# Step 5: Play the normalized text using Piper TTS
- service: media_player.play_media
data:
entity_id:
- media_player.<YOURSPEKAER>
media_content_id: >
media-source://tts/tts.piper?message={{ normalized_response['content']['normalized_text'] }}"
media_content_type: "music"
announce: true
extra:
volume_level: "{{ volume_level }}"
else:
# Handle API error
- service: logbook.log
data:
name: "Weather Report"
message: >
Failed to normalize text. Status: {{ normalized_response['status'] }}
normalize_text:
url: !secret nvidia_nemo_normalize_url
method: POST
headers:
Content-Type: application/json
payload: '{"text": "{{ text | replace("\n", " ") | replace("\"", "\\\"") }}"}'
timeout: 5 # Timeout in seconds
Everything seems legit to me - my Restful sensor work, too - but when I try to call i get an error about rest_command.normalize_text service doesn’t exists (in facts, it doesn’t show as a Rest device at first place)…
I have no idea how I overlooked such things, I apologize. I’ve added support for multiple environment variables to control various configurable aspects of the TN process. Specifically, you can now pass in the language you want to leverage. I’ve also updated the documentation to reflect this. You should now be able to add an environment variable to the Docker run command to set language (-e LANG_TO_USE=it). You need to pull the latest (0.1.3) version of the Docker image.