Alerts, notification, tts but from my own audio clip

Hello everyone,

I am looking for some sort of tutorial where i can learn how to put in place a variety of notification or alert and have them play specific or random audio clip from my own repository.

I can have them play the sound throught out the google speaker of the house or do i need to install new speaker ?


I love the responsiveness, don’t you? I just started trying a similar approach, but don’t want to have to use a google/sonos/etc. speaker. I have an I2S board and little amplifier, and am interested in putting them all together if possible, having the setup only handle pre-created audio notifications. I have multiple motion sensors and am monitoring multiple doors and gates. I want everything to stay inside my own walls, not using any external TTS system after creating the initial recordings.

Did you have any success in your project? I’m sure parts of what you needed might help me out.

I’ve had success! I’m using a DFPlayer mini MP3 player module, with the speaker connected directly to it. I also have a little class-D audio amplifier, but the DFPlayer is plenty loud, so I never used it. I started with fully-formed sentences as the MP3 files, but realized I want more flexibility, and managing the sentences required a lot of notes and memorization. It sounds a little bit robotic, but I’m happy to take that with the flexibility I’ve managed and the ability to keep it all inside the walls of my house. Google and Amazon have no business knowing when I’m home, or when I receive visitors.

I taught my house to speak. I played around with it. Now, I want it to shut up… Balance will take a little while to find. While I was in the middle of creating this, I had a roommate move in, which increased the frequency of notifications about five-fold. I REALLY love the “garbage day tomorrow” spoken notice I get at 11PM on Tuesdays - I’ve always had a hard time remembering garbage day…

I know - current HA version doesn’t use services anymore. Now, they’re called actions. I’m not ready to break other configurations yet, so I haven’t updated.

I used an online text-to-speech site to generate the audio to make up the vocabulary, and then split each word into its own MP3 file, named 0001.mp3 to 0300.mp3. The DFPlayer requires that all files be They’re all on the MicroSD card on the DFPlayer, and I’m not even close to 1GB of audio files. The YAML file needs an array that identifies each file by the word it represents. I had to split my 300-word voocabulary into two arrays. The ESP would enter a panic loop if either array was more than 200 words long. When compiled, even with a 300-word vocabulary, I want to say that I’m using less than half the available flash space on the ESP.

I created a service on the ESP that can be called by HA with a real sentence being passed. The script executed on the ESP parses the “sentence,” looks up the index of each word and then plays the MP3 file named per the index in the array. I added a few tones to let me know a statement was on the way, and used different ones to give the more important ones more of a “kick.” If you specify MP3 file number 0000, it’ll play a random file, so element zero in the first array is just a placeholder. I asked a handful of friends for words to add to the vocabulary, so it drifts into the weeds at times.

On the ESP itself, I can simply execute the script with a full sentence such as

      - script.execute: 
          id: make_speech
          full_speech: 'tonewater laundry room water detected'

From Home Assistant, I can call the service just as easily

  - service: esphome.talker_hall_make_announcement
      full_speech: 'tonemotion front door motion detected'


  - id: words
    type: std::array<std::string, 200>
    initial_value: '{
"nevermore", "tonedub", "tonetrip", "tonequad", "tonewater", 
"tonedoorclose", "tonedooropen", "tonemotion", "tonenotfound", "toneone", 
"tonetwo", "tonethree", "tonefour", "tonefive", "tonesix", 
"toneseven", "toneeight", "tonenine", "toneten", "one", 
"two", "three", "four", "five", "six", 
"seven", "eight", "nine", "ten", "eleven", 
"twelve", "thirteen", "fourteen", "fifteen", "sixteen", 
"seventeen", "eighteen", "nineteen", "twenty", "thirty", 
"fourty", "fifty", "sixty", "seventy", "eighty", 
"ninety", "hundred", "thousand", "monday", "tuesday", 
"wednesday", "thursday", "friday", "saturday", "sunday", 
"january", "february", "march", "april", "may", 
"june", "july", "august", "september", "october", 
"november", "december", "AM", "PM", "day", 
"days", "hour", "hours", "minute", "minutes", 
"second", "seconds", "today", "tomorrow", "yesterday", 
"above", "after", "and", "anniversary", "appointment", 
"back", "bath", "bay", "be", "bed", 
"been", "before", "begin", "begun", "below", 
"birthday", "bottom", "breakfast", "cabinet", "car", 
"ceiling", "clean", "civic", "closed", "dance", 
"daytime", "dentist", "detected", "dining", "dinner", 
"disabled", "dish", "do", "doctor", "door", 
"dryer", "during", "east", "enabled", "end", 
"ended", "family", "finished", "floor", "flower", 
"followup", "for", "front", "garage", "garbage", 
"garden", "gate", "get", "goodbye", "guest", 
"happy", "has", "have", "high", "home", 
"in", "inner", "inside", "interested", "interview", 
"intruding", "is", "jeep", "kitchen", "last", 
"laundry", "leave", "light", "living", "locked", 
"low", "lunch", "machine", "master", "media", 
"medium", "mid", "mode", "motion", "next", 
"nighttime", "noon", "north", "not", "notice", 
"now", "off", "office", "on", "open", 
"outer", "outside", "path", "pathway", "patio", 
"pick", "pie", "porch", "radio", "ready", 
"recharge", "recycle", "refrigerator", "remain", "remaining", 
"replace", "roof", "room", "run", "running", 
"schedule", "set", "shed", "shelf", "side", 

  - id: wordstoo
    type: std::array<std::string, 141>
    initial_value: '{
"south", "space", "star", "start", "started", 
"stop", "stopped", "sunrise", "sunset", "talking", 
"temperature", "time", "top", "trespassing", "truck", 
"turned", "unlocked", "up", "update", "wall", 
"warrant", "wars", "was", "washer", "washing", 
"waste", "water", "welcome", "west", "will", 
"window", "within", "work", "yard", "you", 
"your", "yours", 
<< NAMES HERE >>, 
"wet", "dry", "a", "afternoon", "am", 
"are", "arrived", "at", "away", "call", 
"calling", "contact", "countdown", "cycle", "darryl", 
"departed", "feed", "give", "go", "gone", 
"goodbye", "here", "i", "if", "internal", 
"left", "load", "loaded", "message", "microwave", 
"morning", "night", "nimbus", "note", "please", 
"police", "property", "ready", "right", "take", 
"thank", "thanks", "the", "to", "treat", 
"unload", "unloaded", "welcome", "when", "will", 
"written", "assistance", "baby", "cabinet", "cat", 
"cleared", "dog", "expired", "food", "give", 
"help", "longer", "me", "more", "movie", 
"music", "needed", "no", "over", "prefer", 
"preference", "quiet", "request", "requested", "save", 
"saved", "show", "shut", "silence", "silent", 
"stairs", "stereo", "television", "tell", "time", 
"timer", "treat", "treats", "under", "want", 

  - id: input_string
    type: std::string
    restore_value: no
    initial_value: '""'  # Default value

  - id: word_index
    type: int
    restore_value: no
    initial_value: '9999'  # Default value

  tx_pin: ${pin_tx}
  rx_pin: ${pin_rx}
  baud_rate: 9600
  id: test_uart

  uart_id:  test_uart
  id: dfplayer_talker_hall
      logger.log: 'Playback finished event'

    key: !secret api_key

    - service: make_announcement
        full_speech: string
        - logger.log: "make_announcement called"
        - lambda: |-

  # Takes a sentence, splits into words, identifies the track number and sends to process_word
  # The words are queued by process_word to ensure each plays completely before the next begins
  - id: make_speech
      full_speech: string
    mode: queued
      - lambda: |-
            ESP_LOGI("Making Speech", "Speech: %s", "speech");
            std::string current_word;
            std::string speech = full_speech + " "; // Add space to handle the last word
            for (size_t i = 0; i < speech.length(); i++) {
              if (speech[i] == ' ') {
                // A complete word has been found
                bool found = false;
                for (size_t j = 0; j < id(words).size(); j++) {
                  if (id(words)[j] == current_word) {
                    found = true;
                    id(word_index) = j;
                    // Play action with the index (assuming a play function is defined)
                    // Replace the following line with your actual play action
                    ESP_LOGI("Announcement Maker", "Playing word: %s at index: %d", current_word.c_str(), j);
                    id(process_word).execute(j + 1);
                    break; // Exit the loop once the item is found
                if (!found) {
                  for (size_t j = 0; j < id(wordstoo).size(); j++) {
                    if (id(wordstoo)[j] == current_word) {
                      found = true;
                      id(word_index) = j + 201;
                      // Play action with the index (assuming a play function is defined)
                      // Replace the following line with your actual play action
                      ESP_LOGI("Announcement Maker", "Playing word: %s at index: %d", current_word.c_str(), j + 200);
                      id(process_word).execute(j + 201);
                      break; // Exit the loop once the item is found
                if (!found) {
                  ESP_LOGI("Announcement Maker", "Word '%s' not found.", current_word.c_str());
                current_word.clear(); // Clear the current word for the next iteration
              } else {
                current_word += speech[i]; // Build the current word

  # Takes the track number of the word passed by make_speech and plays it, blocking until the whole word has been played            
  - id: process_word
      word: int
    mode: queued
      - lambda: |-
            if (word > 200){
              ESP_LOGI("Process Word", "Playing word: %d, ==> %s", word - 1, id(wordstoo)[word - 201].c_str() );
              word = word - 1;
            } else {
              ESP_LOGI("Process Word", "Playing word: %d, ==> %s", word, id(words)[word - 1].c_str() );
          file: !lambda 'return word;'
      - delay: 100ms
      - wait_until:
            lambda: return id(dfplayer_talker_hall).is_playing() == false;

So I recently played with Audio I2s using a small amp and speaker. Couldn’t actually get it to play any mp3 files, but that wasn’t my intended goal. I plan to set up more around the house, but for now I only have one in the front door. The idea is to send notifications that people at my door would hear.

If you are looking to stay 100% local, you could use piper to run text to speech. There are different voices to choose from. I live in mexico, and my setup is in spanish. The spanishes voices didn’t sound very good, one was very funny cause it sounded like he was very very high when speaking. I tried using google translate for Text to speech in spanish, and that is what my wife asked me to stay with. For a minute I kept thinking to myself “but it requires the cloud”, but then I realized all 3 of us in the house already use android phones, so google pretty much knows everything we are doing.

A bit off topic, but about 3 months back I did try to set up my local personal assistant using M5stacks. Not nearly as good as Alexa, but it was “usable” to my standards in english. However when I tried to set it up in spanish it barely worked. If I remember correctly the main issue was speech to text which I think is ran by whisper (might be wrong).