Piper TTS Pre-processing

jaycollett · January 10, 2025, 10:22pm

I’ve gained so much from this amazing community, and I wanted to give something back.

To that end, I’ve created a pre-processor API endpoint for PiperTTS using the NVIDIA NeMo framework. The idea came to me after watching the latest Thorsten-Voice video.

This pre-processor is packaged as a stand-alone Docker image, so it can be easily deployed anywhere. I’ve also made a pre-built package available for convenience.

It’s particularly useful for handling dynamic text input with PiperTTS—think weather reports, stock alerts, and other real-time applications. I’m also planning to enhance the Text Normalization (TN) logic to make it even better for TTS use cases.

If you’re interested, you can check it out on GitHub. Hopefully, this tool adds value to someone else in the community!

jaycollett/hass_nemo: Simple Python Docker exposing an API using Nemo to perform text normalization using NVidia NeMo framework.

Rudd-O · January 10, 2025, 11:36pm

What is Text Normalization?

thorsten-voice · January 12, 2025, 12:21am

Guude ,

@jaycollett : Thanks for your work, i’m happy i could inspire you . Hopefully i can give it a try soon.

@Rudd-O : It’s to modify a text to make it easier to pronounce by tts (as piper) service. Remove abbreviations, convert integer / numbers to their written form, … .

Imagine the following text.

Original (better to read for humans, but harder for tts)

Dr. Smith paid $1,234 for 2 items at 3pm after waiting outside at 72°F on may, 15th, 2024. While waiting for the train to arrive at 15:45 he called a support hotline at 1-800-555-0123.

Normalized / Cleaned (better for tts pronounciation):

Doctor Smith paid one thousand two hundred thirty-four dollars for two items at three p m after waiting outside at seventy-two degrees Fahrenheit on May fifteenth, twenty twenty-four. While waiting for the train to arrive at fifteen forty-five he called a support hotline at one eight hundred five five five zero one two three.

Here’s an audio sample on the original and the normalized version:

https://youtu.be/-99WPCIlq-s?t=27 (original)
https://youtu.be/-99WPCIlq-s?t=98 (normalized)

Rudd-O · January 12, 2025, 7:17am

What’s it good for?

thorsten-voice · January 12, 2025, 5:48pm

For more natural sounding tts (text-to-speech/speech synthesis).

ScrubberWalloping · January 13, 2025, 2:47pm

This is a great initiative! It’s one of my biggest frustrations with voice and I’ve been trying to resolve it on the prompts side, but without a lot of success.

For example, I’ve added to the prompt that the LLM should not use 12 hour but instead 24 hour, no AM or PM, in the responses - but it still tells me the time in am and pm. I’ve added to the prompt that it shouldn’t be using abbreviations and that the text is to be used for text to speech, it also doesn’t help. At least it seems I’ve mostly been able to stop it from using american customary units through the prompt.

So at least if there can be some text normalization in the pipeline, that would be great! How is the image/package supposed to be used in the Home Assistant pipeline? Should the TTS service use this and then it forwards the result to piper? Or does it have to be reimplemented in various automations but not as a general drop-in?

Also, how is the normalization performed? I had a quick look at the NeMo page but it’s not clear to me how the framework is being used for text normalization here. Would you be able to share your approach? It would be great to understand so it would be easier to contribute.

For example, when piper gets a list (for example of lights in an area), the number for the next item is said with the previous item because there is a “full stop” after the number:

Left spot light
Right spot light

Becomes “One <pause> Left spot light two <pause> Right spot light”

As you can see, I’m very exited about a comprehensive approach to fixing the TTS pipeline!

theclue · January 14, 2025, 2:44pm

This is an incredible valuable effort!

But I’m puzzled…how can I embed in my HA pipeline?

Ok, I can run the docker on a different server (side note: should I do?) and calls REST API upon it, but what the HA Piper pipeline?

jaycollett · January 14, 2025, 3:02pm

Ideally, this would be embedded in the Wyoming Piper TTS solution, perhaps with a setting to apply Text Normalization (TN) or not. But until that’s a reality I’ve decided to make do with this solution. I’m working through understanding the mechanics of the NeMo framework myself, but the results thus far have been outstanding in every scenario I sent it’s way!

jaycollett · January 14, 2025, 3:07pm

You need to run the docker container on a machine that your HA instance can hit, which could be the same as HA if it’s not a low-power system like a PI. Then create a rest command on your HA instance and use that to pre-process any dynamic text you want to leverage. If you can gain access to the text, you can send that over to the API and get back normalized text to send to Piper or any other TTS engine for that matter, it’s not specific to Piper.

So create a rest command, something like this:

normalize_text:
  url: "http://<YOUR API IP HERE>:5000/normalize"
  method: POST
  headers:
    Content-Type: application/json
  payload: '{"text": "{{ text | replace("\n", " ") | replace("\"", "\\\"") }}"}'
  timeout: 5 # Timeout in seconds

Once you created/configured the rest_command in HA, you can leverage it in scripts or automation. In the example automation here, I take the text I want to be converted to speech and feed it to the API endpoint, the API will return the string of text normalized, in this example, the “message” is the text I want to normalize. Then in step 5 of this example, you can see that I just take the normalized text from the API call and send it over to Piper to convert to speech and play on one of my media_player end-points.:

  # Step 2: Call the API and process the response
  - action: rest_command.normalize_text
    response_variable: normalized_response
    data:
      text: "{{ message }}"

  # Step 3: Check if the API response was successful
  - if: "{{ normalized_response['status'] == 200 }}"
    then:
      # Step 4: Log the normalized text
      - service: logbook.log
        data:
          name: "Weather Report"
          message: >
            Normalized Text: {{ normalized_response['content']['normalized_text'] }}

      # Step 5: Play the normalized text using Piper TTS
      - service: media_player.play_media
        data:
          entity_id:
            - media_player.<YOURSPEKAER>
          media_content_id: >
            media-source://tts/tts.piper?message={{ normalized_response['content']['normalized_text'] }}"
          media_content_type: "music"
          announce: true
          extra:
            volume_level: "{{ volume_level }}"
    else:
      # Handle API error
      - service: logbook.log
        data:
          name: "Weather Report"
          message: >
            Failed to normalize text. Status: {{ normalized_response['status'] }}

theclue · January 14, 2025, 3:13pm

Ok, tnx allot! Going to install rite now! I’ll let you know my impressions

But on my OpenMediaVault box for now. It’s easier to manage composer stuff outside home assistant, in my opinion,

Perhaps you could consider creating a HA component for an easier integration?

theclue · January 14, 2025, 4:50pm

Probably I’m doing something wrong, but I cannot make rest_command working:

In configuration.yaml

rest_command: !include_dir_merge_list includes/rest_commands

In includes/rest_commands/nvidia.yaml:

normalize_text:
  url: !secret nvidia_nemo_normalize_url
  method: POST
  headers:
    Content-Type: application/json
  payload: '{"text": "{{ text | replace("\n", " ") | replace("\"", "\\\"") }}"}'
  timeout: 5 # Timeout in seconds

Everything seems legit to me - my Restful sensor work, too - but when I try to call i get an error about rest_command.normalize_text service doesn’t exists (in facts, it doesn’t show as a Rest device at first place)…

jaycollett · January 14, 2025, 5:42pm

You need a named dir merge, try this:

rest_command: !include_dir_merge_named includes/rest_commands

theclue · January 15, 2025, 9:15am

tnx! it works now

I was wondering if there is a way to parametrize the docker running to a different output language. It outputs in english atm.

jaycollett · January 15, 2025, 3:39pm

I have no idea how I overlooked such things, I apologize. I’ve added support for multiple environment variables to control various configurable aspects of the TN process. Specifically, you can now pass in the language you want to leverage. I’ve also updated the documentation to reflect this. You should now be able to add an environment variable to the Docker run command to set language (-e LANG_TO_USE=it). You need to pull the latest (0.1.3) version of the Docker image.