Have you ever been startled by your Voice Assistant being way too loud in the quiet of the night? Or struggled to hear it over the chaos of a busy household? Well, I’ve added a nifty dynamic volume control that automatically adjusts based on ambient sound levels!
The solution is implemented entirely through ESPHome Device Builder. You’ll need to take over your Home Assistant Voice PE device (Add the ESPHome Device Builder add-on and find your device there) and use the YAML below instead of the default configuration.
Sensors
Using the existing microphone, I’ve implemented three sensors:
- Peak detection (raw values)
- Linear scaling (percentage)
- Exponential scaling to put quieter levels in a more usefull range.
Dynamic Volume Control
The system adjusts in real-time to the noise level in your room:
- Anchor Volume: Your base volume level when it’s quiet
- Strength: How aggressively it scales up in noisy situations
- Simple toggle to enable/disable
Bonus feature: It turns out this is a pretty decent proxy for presence detection - at least if you have young kids! The ambient sound level spikes whenever there’s activity in the room.
Note: This is still a work in progress - I’m still tweaking the scaling to find the sweet spot, but it’s already quite useful. No more heart attacks when asking for the weather at 3 AM!
Installation
Here’s the complete YAML - just copy this into your ESPHome Device Builder configuration:
substitutions:
name: home-assistant-voice-REPLACE_WITH_YOUR_OWN
friendly_name: Home Assistant Voice REPLACE_WITH_YOUR_OWN
packages:
Nabu Casa.Home Assistant Voice PE: github://esphome/home-assistant-voice-pe/home-assistant-voice.yaml
globals:
- id: dynamic_volume_enabled
type: bool
restore_value: yes
initial_value: 'false'
- id: last_dynamic_volume_calculation
type: float
restore_value: no
initial_value: '0'
number:
- platform: template
name: "Dyn. Vol. Anchor"
id: dynamic_volume_anchor
min_value: 0.1
max_value: 0.85
step: 0.05
initial_value: 0.3
restore_value: true
optimistic: true
icon: "mdi:volume-high"
unit_of_measurement: "x"
entity_category: config
- platform: template
name: "Dyn. Vol. Strength"
id: dynamic_volume_strength
min_value: 0
max_value: 5
step: 0.1
initial_value: 1.0
restore_value: true
optimistic: true
icon: "mdi:volume-vibrate"
unit_of_measurement: "x"
entity_category: config
switch:
- platform: template
name: "Dynamic Volume"
id: dynamic_volume_switch
icon: "mdi:volume-vibrate"
optimistic: true
restore_mode: RESTORE_DEFAULT_OFF
entity_category: config
turn_on_action:
- lambda: id(dynamic_volume_enabled) = true;
- script.execute: update_dynamic_volume
turn_off_action:
- lambda: |-
id(dynamic_volume_enabled) = false;
// Reset to anchor volume when disabled
id(nabu_media_player)
->make_call()
.set_volume(id(dynamic_volume_anchor).state)
.perform();
sensor:
# Peak amplitude sensor
- platform: template
name: "Ambient Sound Peak"
id: ambient_sound_peak
unit_of_measurement: "max"
accuracy_decimals: 6
update_interval: 1s
icon: "mdi:microphone-outline"
state_class: "measurement"
lambda: |-
static const char *const TAG = "ambient_sound";
static const size_t INPUT_BUFFER_SIZE = 512;
static int16_t input_buffer[INPUT_BUFFER_SIZE];
// Don't measure when media is playing to avoid feedback
if (id(nabu_media_player)->state != media_player::MEDIA_PLAYER_STATE_IDLE) {
ESP_LOGD(TAG, "Media player not idle.");
return id(ambient_sound_peak).state; // Return previous value
}
// Check if micro_wake_word is ready
if (!id(mww).is_ready()) {
ESP_LOGD(TAG, "Micro wake word not ready yet");
return 0;
}
// Start mic if needed
if (!id(asr_mic)->is_running()) {
id(asr_mic)->start();
delay(50); // Give mic time to start
}
size_t bytes_read = id(asr_mic)->read(
input_buffer,
INPUT_BUFFER_SIZE * sizeof(int16_t),
0
);
if (bytes_read == 0) {
memset(input_buffer, 0, INPUT_BUFFER_SIZE * sizeof(int16_t));
ESP_LOGD(TAG, "No samples read from microphone");
return 0;
}
size_t samples_read = bytes_read / sizeof(int16_t);
// Find maximum absolute value
float max_value = 0;
for (size_t i = 0; i < samples_read; i++) {
float normalized = abs(input_buffer[i]) / 32768.0f;
max_value = max(max_value, normalized);
}
ESP_LOGD(TAG, "Max amplitude: %.6f", max_value);
return max_value;
# Linear scaling sensor using peak amplitude as input
- platform: template
name: "Ambient Sound Level"
id: ambient_sound_level
unit_of_measurement: "%"
accuracy_decimals: 1
update_interval: 1s
icon: "mdi:microphone-outline"
state_class: "measurement"
filters:
- sliding_window_moving_average:
window_size: 5
lambda: |-
float peak = id(ambient_sound_peak).state;
if (std::isnan(peak)) {
return 0;
}
// Simple linear scaling between min/max peak values
const float MIN_PEAK = 0.000024f;
const float MAX_PEAK = 0.9f;
float percentage = 0;
if (peak > MIN_PEAK) {
percentage = (peak - MIN_PEAK) / (MAX_PEAK - MIN_PEAK) * 100;
percentage = clamp(percentage, 0.0f, 100.0f);
}
ESP_LOGD("ambient_sound", "Linear Percentage: %.1f%%", percentage);
return percentage;
# Exponential scaling sensor
- platform: template
name: "Ambient Sound Level Exp"
id: ambient_sound_level_exp
unit_of_measurement: "%"
accuracy_decimals: 1
update_interval: 1s
icon: "mdi:microphone-outline"
state_class: "measurement"
lambda: |-
float linear_value = id(ambient_sound_level).state;
if (std::isnan(linear_value)) {
return 0;
}
// Apply exponential curve
// Using x^0.4 which gives more resolution to lower values while
// still maintaining a reasonable curve
constexpr float exp = 0.4f;
float percentage = pow(linear_value / 100.0f, exp) * 100.0f;
ESP_LOGD("ambient_sound_exp", "Exponential scaling: %.1f%% -> %.1f%%",
linear_value, percentage);
return percentage;
script:
- id: update_dynamic_volume
mode: single
then:
- lambda: |-
if (!id(dynamic_volume_enabled)) return;
float ambient_level = id(ambient_sound_level_exp).state;
if (std::isnan(ambient_level)) return;
float anchor = id(dynamic_volume_anchor).state;
float strength = id(dynamic_volume_strength).state;
// Convert ambient level to 0-1 range
float normalized_level = ambient_level / 100.0f;
// Calculate gain factor based on ambient level and strength
float gain = 1.0f + (normalized_level * strength);
// Calculate new volume
float new_volume = anchor * gain;
// Clamp to valid range
new_volume = clamp(new_volume, 0.0f, 1.0f);
// Only update if changed significantly
if (abs(new_volume - id(last_dynamic_volume_calculation)) > 0.01) {
id(last_dynamic_volume_calculation) = new_volume;
id(nabu_media_player)
->make_call()
.set_volume(new_volume)
.perform();
ESP_LOGD("dynamic_volume", "Ambient: %.1f%%, Gain: %.2f, New Volume: %.2f",
ambient_level, gain, new_volume);
}
interval:
- interval: 1s
then:
- script.execute: update_dynamic_volume
logger:
level: DEBUG
logs:
dynamic_volume: DEBUG
esphome:
name: ${name}
name_add_mac_suffix: false
friendly_name: ${friendly_name}
api:
encryption:
key: YOUR_OWN_SUPER_SECRET_API_KEY
wifi:
ssid: !secret wifi_ssid
password: !secret wifi_password
After installing (be patient, building the firmware for the first time can take a long time), you’ll find the new controls in your device’s Configuration panel. Play around with the Anchor and Strength values to find what works best for your space!
Initially, I tried using RMS (Root Mean Square) to measure the ambient sound level, but I found that peak detection correlates much better with what we actually experience as “noise level”. The peaks in audio better represent those moments when you think “wow, it’s noisy in here” - exactly when you want your assistant to speak up!
What do you think? Seems like a great function to have in the device by default. I’ts practically free