TTS limitations and overcoming them

I have spent far too much time recently stressing over my dissatisfaction with TTS. There are a few clever implementations I have found around here but none as far as I know that that can address two glaring problems that I see.

  1. For complex announcements constructed from more than one element I have not found a way to do this unless the elements are always in the same order. The stumbling block for me is that as far as I understand it, an input_text (and indeed any input_ has a maximum size of 256 bytes (characters) which precludes the possibility of constructing the text of the announcement outside of the TTS service call.
  2. On a completely different note I can think of no way to queue announcements. This means that on the occasion when more than one occur together, one or more will/can get lost and it is unpredictable behaviour based on timing. This is especially so if more than one announcement is generated by multiple state changes in the same automation.

Has anyone got any ideas to get around either of these? I suspect item 2 especially might be insurmountable but I have my doubts about both as I feel sure someone would have followed the same thought process as me and done them by now.

You could use the variables custom component. It does not have a char limit…
I use it for building complex google static map urls :slight_smile:

Yes, I’ve looked at that more than a few times but I am strangely resistant to using a custom component for something which is effectively a fundamental feature. It feels like Pandora’s box in that it will undoubtedly find it’s way intertwined throughout my system and I would worry then about what happens if HA ever (as I believe it probably should) implements a native global variable feature.

And just in the interests of clarity this is clearly an endorsement and not a criticism of how good I think the variable custom component probably is!

Maybe I should just bite the bullet…

Ok, so I found this, Pi audio component which appears to be able to create mp3 files and might be a pointer to both issues…

When using TTS, the announcements are stored in the config/tts folder but is it possible using the default TTS service (specifically in my case Google) to have the the mp3 file created but without it actually being played?

Then, phrases longer than 256 characters could be constructed, stored (queued!?) and then played under some sort of software control.

I can think of clunky ways to do this, i.e. play with a volume of zero or have a sensor check for changes in the config\tts folder then stop the TTS script etc. but It would be nice to do it ‘cleanly’.

Did you ever work something out for this issue? I used to have the same problem with TTS announcement collisions. For example when someone came home, opened the garage door, and disarmed the alarm all within a short time frame, all these events trigger a TTS announcement and some announcements would always get skipped.

I recently put together a simple but very effective message queue system that keeps several TTS messages in a queue and plays them in order. It’s all in Home Assistant, nothing external needed. I’m sure it’s not the most elegant solution but I’ve had it going for a few weeks now and it’s working as expected.

If you didn’t come up with anything else I can put it together in a package or something for you if you want to see if it will work for you. I actually came across this thread searching around to see if I should bother posting it…

Well…
I have spent way too many hours on various methods of stopping announcement collisions and only today I decided to spend a few more hours simplifying it. I don’t think my method is anywhere near perfect so I’d be very interested to see how you’ve done it.

Definitely post it please!

If you’re using Google Home devices, you can use the node red node called “node-red-contrib-cast” to send tts to individual devices, use templates to change the message, rate limit, change the volume on the fly etc…

it works really well.

1 Like

I have to go to work in a bit so I didn’t have time to put a complete package together but here’s the nuts and bolts of it so you can see what I did. I’m pretty sure you can pull something useful out of this.

I know it works well for me. If it works for you and you think it could be useful for anyone else I can put it together in a tidy little package that isn’t so specifically geared towards my configuration.

Looks interesting. I’ve only played around with node red a bit and it was a while ago back when it wasn’t so stable so it was quite the exercise in frustration. Plus I’m already killing my RPi with other stuff. I’m contemplating upgrading my HA setup soon so I’ll probably have another real good look at node red then.

Interesting. A different approach to mine although we do have a few very similar details (perhaps not surprising given the limitations of yaml). Also I use Sonos which presents a few other small challenges.

I don’t have a queue as such, but every announcement waits for the central announcement ‘engine’ to be idle before it attempts to take control of it. I’m not convinced my way is as robust as yours but it seems to work ok. I can see advantages and disadvantages to both methods. Maybe we need some kind of hybrid!!

Anyway I like what you’ve done and may well borrow a few of those ideas. Definitely post a tidied up package if you have the time, it is always useful to have many examples to look at and who knows who’ll be along in the future and want to do this? One of the great things about this forum is discovering what other have done.

I use a simple input boolean and a wait template each time I ask Alexa to speak. The script portion which controls what is being spoken looks like this:

# Make Announcement
- service: input_boolean.turn_on
  entity_id: input_boolean.alexa_speaking
- service: notify.alexa_media
  data_template:
    target:
      - media_player.kitchen_echo
    data:
      type: tts
    message: "Your message goes here"
- delay: '00:00:05'
- service: input_boolean.turn_off
  entity_id: input_boolean.alexa_speaking

The input boolean is turned on just prior to the call to speak the text. Then there is a brief delay - long enough for all the text to be spoken, followed by the input boolean being turned off.

In addition, the automation that triggers the script contains a wait template to check if the input boolean is on or off. That looks like this:

  action:
  - service: input_select.select_option
    data:
      entity_id: input_select.alexa_entity
      option: "media_player.kitchen_echo"
    # Wait for Alexa to be ready
  - wait_template: "{{ is_state('input_boolean.alexa_speaking', 'on') }}"
    timeout: '00:00:20'
    continue_on_timeout: 'true'
  - service: script.turn_on
    entity_id: script.set_alexa_volume
    data_template:
      variables:
        alexa_entity: "{{ states('input_select.alexa_entity') }}"
  - delay: '00:00:03'
  - service: script.turn_on
    entity_id: script.announce_morning_briefing
    data_template:
      variables:
        alexa_entity: "{{ states('input_select.alexa_entity') }}"
  - delay: '00:00:03'
  - service: script.turn_on
    entity_id: script.morning_briefing
    data_template:
      variables:
        recipient: 'Art'
        alexa_entity: "{{ states('input_select.alexa_entity') }}"

Again, you can configure the time delay to encompass any potentially long text. You may also notice that I use a two part briefing announcement. The first part uses the ‘announce’ data type:

announce_morning_briefing:
  alias: "Announce Morning Briefing"
  sequence:
  - service: notify.alexa_media
    data_template:
      target:
        - "{{ alexa_entity }}"
      data:
        type: announce
      message: >-
        Your daily briefing is about to start.

and the second part uses the ‘tts’ data type:

morning_briefing:
  alias: Morning Briefing
  sequence:
  # Read Briefing
  - service: notify.alexa_media
    data_template:
      target:
        - "{{ alexa_entity }}"
      data:
        type: tts
        recipient: "{{ recipient }}"
      message: >-
        {# Salutation #}
        {% if states.sensor.time_of_day.state != '' %}
          Good {{states.sensor.time_of_day.state}} {{ recipient }}.
        {% endif %}

I found there is a slight difference in the flow and cadence of TTS notifications and it seems to be smoother and more natural than the same text using ANNOUNCE.

This solution works well for me and I never seem to miss spoken notifications any more.

My third or fourth crack at the can was very similar to what you have illustrated here. It works great - for 2 short announcements. But - if you were to trigger an extended or a third tts announcement you will run into the issues that @klogg and myself were talking about.

What happens is if an automation being held in running (on) state by a delay or wait template and the automation is triggered again, the current running wait template or delay ends immediately and the automation will continue on with the next step in the script.

@pnbruckner explains it very well in this thread

I put together another example below with wait templates to illustrate if you care to try it out. Basically, if you fire all three automations within the 1 minute wait template tts1 and tts2 will play, the third tts will result in a warning in the log that the script is already running and the message will not play.

The key to the way my system works is the scripts holding the messages in the queue cannot be called more than once. Each message is handed off to a separate queue script. If the queue is full it just discards the last message without messing up the current queue. That should never actually happen for me as I can’t think of a time when 5 messages will actually all be triggered all at once, but if that is an issue the size of the queue can be increased to whatever might be required. I just chose 5 because I figured that was more than I’ll ever need.

The other neat thing here is there is no predetermined message length. The queue is cycled by the media player play stopping, not a predetermined time limit. Once message can 45 seconds, another 5 seconds and another 20 seconds and they all play seamlessly without unnecessary delays.

I do have to preface every call to my play_announcement script with a short wait template check to see if it’s already running because there is a small possibility of the play_announcement script being in a running state for a couple of seconds while the message_queue script runs so that acts as little buffer for this circumstance.

    - wait_template: "{{ is_state('script.play_announcement', 'off') }}"
      timeout: '0:00:10'
      continue_on_timeout: true

    - service: script.play_announcement
      data_template:
        play_message: "Test"

Wait Template Example

input_boolean:
  test_tts1:
    name: TTS1
  test_tts2:
    name: TTS2
  test_tts3:
    name: TTS3

automation:
  - id: test_tts1
    alias: "Test TTS1"
    trigger:
      - platform: state
        entity_id: input_boolean.test_tts1
        to: 'on'
    action:
      - wait_template: "false"
        timeout: '0:01:00'
        continue_on_timeout: true
      - service: script.play_tts
        data:
          message: "test 1"

  - id: test_tts2
    alias: "Test TTS2"
    trigger:
      - platform: state
        entity_id: input_boolean.test_tts2
        to: 'on'
    action:
      - wait_template: "false"
        timeout: '0:01:00'
        continue_on_timeout: true
      - service: script.play_tts
        data:
          message: "test 2"

  - id: test_tts3
    alias: "Test TTS3"
    trigger:
      - platform: state
        entity_id: input_boolean.test_tts2
        to: 'on'
    action:
      - wait_template: "false"
        timeout: '0:01:00'
        continue_on_timeout: true
      - service: script.play_tts
        data:
          message: "test 3"

script:
  play_tts:
    sequence:
      - service: tts.google_say
        data_template:
          entity_id: media_player.dining_room_speaker
          message: "{{message}}"

Strange this issue is still there after 3 years!!

Not sure since when, but now you can implement it using the queued mode of scripts. In this case I call the script instead with the message and optional cache, it will roughly try to estimate the duration based on the message length plus add some extra delay… since the script is queued, calling it immediately a 2nd time will cause it to be queued, You can also make the speaker entity as a parameter by adding it to fields…

  speak:
    alias: 'Speak'
    mode: queued
    fields:
      message:
        required: true
        description: 'The message that will be spoken'
        example: "hello"
      cache:
        required: false
        default: 'true'
        description: 'Cache the tts sound file, default true'
        example: 'false'
    sequence:
    - service: tts.google_translate_say
      data_template:
        entity_id: media_player.minispeaker
        message: "{{ message }}"
        cache: "{{cache if cache is defined else 'true'}}"
    - delay:
        seconds: >
          {% set l = message|length %}
          {% set speed = 75 %}
          {% set duration_seconds = ((speed * l)/1000)|round(0,method='ceil')|int %}
          {{ duration_seconds }}
    - delay: 00:00:03

To call it just use

service: script.speak
data:
  message: 'Hello world'
1 Like

Why not use a wait template instead?

  speak:
    alias: 'Speak'
    mode: queued
    fields:
      message:
        required: true
        description: 'The message that will be spoken'
        example: "hello"
      cache:
        required: false
        default: 'true'
        description: 'Cache the tts sound file, default true'
        example: 'false'
    sequence:
    - service: tts.google_translate_say
      data_template:
        entity_id: media_player.minispeaker
        message: "{{ message }}"
        cache: "{{cache if cache is defined else 'true'}}"
    - wait_template: "{{ is_state('media_player.minispeaker','playing') }}"
    - wait_template: "{{ is_state('media_player.minispeaker','idle') }}"

That way it waits for the media player to start playing (the TTS) and then for it to stop (becoming idle).

No estimation or guesswork required.

1 Like

Hey
That was my first attempt, but the media player does not always switches to ‘playing’ and ‘idle’…sometimes remains just idle state…Maybe because I use a vlc type local media player.?..

As the OP I guess I should add my tuppence…
My notification system has morphed over the years but has been very stable and reliable for a long time now.

Why not combine both using a timeout?

    - wait_template: "{{ is_state('media_player.minispeaker','playing') }}"
      timeout:
        seconds: 3

    - wait_template: "{{ states('media_player.minispeaker')  in ['paused', 'idle'] }}"
      timeout:
        seconds: >
          {% if is_number(state_attr('media_player.minispeaker', 'media_duration')) %}
            {{ state_attr('media_player.minispeaker', 'media_duration') }}
          {% else %}
            {% set char_count = (split_message[repeat.index - 1]) | length | float %}
            {% set duration_estimate = char_count / 2 * 0.26 %}
              {{ duration_estimate }}
          {% endif %}

I can’t speak to how using VLC would affect this, I use Sonos and I also used to experience the state not changing to ‘playing’. I don’t know if it still does that, maybe because it is fixed for Sonos now or maybe because my ‘wait/timeout’ works so well? :wink: :grinning:

(I also use a different calculation for duration based on text length. I can’t remember where I got it from but it is not mine)

Yes I was thinking similar to combine them, thanks for posting this! Though I don’t follow from this snippet what is split_message and repeat.index…maybe post the whole script?

I have a hunch that if the message is short, the media player state change is done with a bit of delay and it will not have time to notice it … So combining the 2 will fix for both cases.

For estimation I took a longer message, downloaded the tts generated sound file from home assistant and divided the time of the sound file by the message length…resulting in roughly 75ms per character…so far worked well, I will probably combine them at some point.

Yes, sorry… I have a way to send more than one message and have a pause in between which is what the split_message is so you can ignore all of that and just use
{% set char_count = message | length %}

I won’t post the whole script because it is part of a whole package of scripts that handle all sorts of different message types.

That was exactly my thinking!