Teaching your HA Voice PE to Whisper and Shout - Smart Volume using Built-in Mics

Jaapp · January 20, 2025, 10:19pm

Have you ever been startled by your Voice Assistant being way too loud in the quiet of the night? Or struggled to hear it over the chaos of a busy household? Well, I’ve added a nifty dynamic volume control that automatically adjusts based on ambient sound levels!

The solution is implemented entirely through ESPHome Device Builder. You’ll need to take over your Home Assistant Voice PE device (Add the ESPHome Device Builder add-on and find your device there) and use the YAML below instead of the default configuration.

Sensors

Using the existing microphone, I’ve implemented three sensors:

Peak detection (raw values)
Linear scaling (percentage)
Exponential scaling to put quieter levels in a more usefull range.

Dynamic Volume Control

The system adjusts in real-time to the noise level in your room:

Anchor Volume: Your base volume level when it’s quiet
Strength: How aggressively it scales up in noisy situations
Simple toggle to enable/disable

Bonus feature: It turns out this is a pretty decent proxy for presence detection - at least if you have young kids! The ambient sound level spikes whenever there’s activity in the room.

Note: This is still a work in progress - I’m still tweaking the scaling to find the sweet spot, but it’s already quite useful. No more heart attacks when asking for the weather at 3 AM!

Installation

Here’s the complete YAML - just copy this into your ESPHome Device Builder configuration:

substitutions:
  name: home-assistant-voice-REPLACE_WITH_YOUR_OWN
  friendly_name: Home Assistant Voice REPLACE_WITH_YOUR_OWN

packages:
  Nabu Casa.Home Assistant Voice PE: github://esphome/home-assistant-voice-pe/home-assistant-voice.yaml

globals:
  - id: dynamic_volume_enabled
    type: bool
    restore_value: yes
    initial_value: 'false'
  - id: last_dynamic_volume_calculation
    type: float
    restore_value: no
    initial_value: '0'

number:
  - platform: template
    name: "Dyn. Vol. Anchor"
    id: dynamic_volume_anchor
    min_value: 0.1
    max_value: 0.85
    step: 0.05
    initial_value: 0.3
    restore_value: true
    optimistic: true
    icon: "mdi:volume-high"
    unit_of_measurement: "x"
    entity_category: config
    
  - platform: template 
    name: "Dyn. Vol. Strength"
    id: dynamic_volume_strength
    min_value: 0
    max_value: 5
    step: 0.1
    initial_value: 1.0
    restore_value: true
    optimistic: true
    icon: "mdi:volume-vibrate"
    unit_of_measurement: "x"
    entity_category: config

switch:
  - platform: template
    name: "Dynamic Volume"
    id: dynamic_volume_switch
    icon: "mdi:volume-vibrate"
    optimistic: true
    restore_mode: RESTORE_DEFAULT_OFF
    entity_category: config
    turn_on_action:
      - lambda: id(dynamic_volume_enabled) = true;
      - script.execute: update_dynamic_volume
    turn_off_action:
      - lambda: |-
          id(dynamic_volume_enabled) = false;
          // Reset to anchor volume when disabled
          id(nabu_media_player)
            ->make_call()
            .set_volume(id(dynamic_volume_anchor).state)
            .perform();

sensor:
  # Peak amplitude sensor
  - platform: template
    name: "Ambient Sound Peak"
    id: ambient_sound_peak
    unit_of_measurement: "max"
    accuracy_decimals: 6
    update_interval: 1s
    icon: "mdi:microphone-outline"
    state_class: "measurement"
    lambda: |-
      static const char *const TAG = "ambient_sound";
      static const size_t INPUT_BUFFER_SIZE = 512;
      static int16_t input_buffer[INPUT_BUFFER_SIZE];
      
      // Don't measure when media is playing to avoid feedback
      if (id(nabu_media_player)->state != media_player::MEDIA_PLAYER_STATE_IDLE) {
        ESP_LOGD(TAG, "Media player not idle.");
        return id(ambient_sound_peak).state; // Return previous value
      }
      
      // Check if micro_wake_word is ready
      if (!id(mww).is_ready()) {
        ESP_LOGD(TAG, "Micro wake word not ready yet");
        return 0;
      }

      // Start mic if needed
      if (!id(asr_mic)->is_running()) {
        id(asr_mic)->start();
        delay(50); // Give mic time to start
      }

      size_t bytes_read = id(asr_mic)->read(
        input_buffer, 
        INPUT_BUFFER_SIZE * sizeof(int16_t),
        0
      );
      
      if (bytes_read == 0) {
        memset(input_buffer, 0, INPUT_BUFFER_SIZE * sizeof(int16_t));
        ESP_LOGD(TAG, "No samples read from microphone");
        return 0;
      }
      
      size_t samples_read = bytes_read / sizeof(int16_t);
      
      // Find maximum absolute value
      float max_value = 0;
      for (size_t i = 0; i < samples_read; i++) {
        float normalized = abs(input_buffer[i]) / 32768.0f;
        max_value = max(max_value, normalized);
      }
      
      ESP_LOGD(TAG, "Max amplitude: %.6f", max_value);
      return max_value;

  # Linear scaling sensor using peak amplitude as input
  - platform: template
    name: "Ambient Sound Level"
    id: ambient_sound_level
    unit_of_measurement: "%"
    accuracy_decimals: 1
    update_interval: 1s
    icon: "mdi:microphone-outline"
    state_class: "measurement"
    filters:
      - sliding_window_moving_average:
          window_size: 5
    lambda: |-
      float peak = id(ambient_sound_peak).state;
      if (std::isnan(peak)) {
        return 0;
      }
      
      // Simple linear scaling between min/max peak values
      const float MIN_PEAK = 0.000024f;
      const float MAX_PEAK = 0.9f;
      
      float percentage = 0;
      if (peak > MIN_PEAK) {
        percentage = (peak - MIN_PEAK) / (MAX_PEAK - MIN_PEAK) * 100;
        percentage = clamp(percentage, 0.0f, 100.0f);
      }
      
      ESP_LOGD("ambient_sound", "Linear Percentage: %.1f%%", percentage);
      return percentage;

  # Exponential scaling sensor
  - platform: template
    name: "Ambient Sound Level Exp"
    id: ambient_sound_level_exp
    unit_of_measurement: "%"
    accuracy_decimals: 1
    update_interval: 1s
    icon: "mdi:microphone-outline"
    state_class: "measurement"
    lambda: |-
      float linear_value = id(ambient_sound_level).state;
      if (std::isnan(linear_value)) {
        return 0;
      }
      
      // Apply exponential curve
      // Using x^0.4 which gives more resolution to lower values while
      // still maintaining a reasonable curve
      constexpr float exp = 0.4f;
      float percentage = pow(linear_value / 100.0f, exp) * 100.0f;
      
      ESP_LOGD("ambient_sound_exp", "Exponential scaling: %.1f%% -> %.1f%%", 
               linear_value, percentage);
      
      return percentage;

script:
  - id: update_dynamic_volume
    mode: single
    then:
      - lambda: |-
          if (!id(dynamic_volume_enabled)) return;
          
          float ambient_level = id(ambient_sound_level_exp).state;
          if (std::isnan(ambient_level)) return;
          
          float anchor = id(dynamic_volume_anchor).state;
          float strength = id(dynamic_volume_strength).state;
          
          // Convert ambient level to 0-1 range
          float normalized_level = ambient_level / 100.0f;
          
          // Calculate gain factor based on ambient level and strength
          float gain = 1.0f + (normalized_level * strength);
          
          // Calculate new volume 
          float new_volume = anchor * gain;
          
          // Clamp to valid range
          new_volume = clamp(new_volume, 0.0f, 1.0f);
          
          // Only update if changed significantly
          if (abs(new_volume - id(last_dynamic_volume_calculation)) > 0.01) {
            id(last_dynamic_volume_calculation) = new_volume;
            id(nabu_media_player)
              ->make_call()
              .set_volume(new_volume)
              .perform();
            
            ESP_LOGD("dynamic_volume", "Ambient: %.1f%%, Gain: %.2f, New Volume: %.2f", 
                     ambient_level, gain, new_volume);
          }

interval:
  - interval: 1s
    then:
      - script.execute: update_dynamic_volume

logger:
  level: DEBUG
  logs:
    dynamic_volume: DEBUG

esphome:
  name: ${name}
  name_add_mac_suffix: false
  friendly_name: ${friendly_name}

api:
  encryption:
    key: YOUR_OWN_SUPER_SECRET_API_KEY

wifi:
  ssid: !secret wifi_ssid
  password: !secret wifi_password

After installing (be patient, building the firmware for the first time can take a long time), you’ll find the new controls in your device’s Configuration panel. Play around with the Anchor and Strength values to find what works best for your space!

Initially, I tried using RMS (Root Mean Square) to measure the ambient sound level, but I found that peak detection correlates much better with what we actually experience as “noise level”. The peaks in audio better represent those moments when you think “wow, it’s noisy in here” - exactly when you want your assistant to speak up!

What do you think? Seems like a great function to have in the device by default. I’ts practically free