Hacking a bluetooth speaker into a Voice Assistant

I wanted to join the year of the voice, albeit a couple of years late! I have a number of Google nest speakers around the house, but they can’t control every thing that is managed by home assistant, plus I am slowly trying to move more to local services.

I had a few options

Not wanting things to be too easy, I thought I’d try the build it myself route. To do this, I could use my 3D printer to make an enclosure, but these can look a bit amateurish plus as I also needed a speaker I thought why not find a cheap portable speaker and hack that. I had a hunt on Aliexpress and found the “Lenovo K3 Pro 5.0 Portable Bluetooth Speaker Stereo Surround Wireless Bluetooth Speakers Music Audio Player Loudspeaker” for USD$6 delivered. Done.

When it arrived I found it looked decent enough, and sounded pretty good as well.


Although initially I didn’t plan to cast music to it, maybe later down the track I will. The thing claims to have a 1200mAh battery in it, and indeed on opening it I found it had a 18650 1200mAh battery. Speaking of which, to open:

  1. Peel off the rubber plate over the buttons
  2. Loosen the three phillips head screws - they are quite recessed, so will need a long screwdriver to access them.
  3. The base and speaker cover can then be separated. Be very careful when removing things as a) the wires are very thin and delicate and b) it would be super easy to short circuit the battery and allow the magic smoke to escape
  4. The speaker rests inside the speaker enclosure - no clips or glue - and for assembly is held in there by friction then a foam block presses between the battery and the speaker magnet
  5. The battery is held in by two clips - slowly pull it upwards on one side and then can pull it out from the other side
  6. The circuit board is held on by two phillips head screws, and the usb port is held in with one screw

The battery will provide somewhere from 3.7V to 4.2V (maybe more, maybe less) depending on the state of charge. The button used to turn the speaker on/off has roughly 3.3V available constantly from the bottom left hand pin (looking at the board with the button and antenna facing you) and the top right hand pin. When the power is on, that voltage does drop slightly, but probably will still provide enough power for an ESP device.

Planning
So three options:

  1. Grab 5V from the USB port, but that would then mean the ESP would not be available if the speaker is running off battery
  2. Grab 3.3V from the power button
  3. Grab the power directly from the battery - but that will need something to manage the power so would be slightly more complicated

I am currently assuming that this will be powered constantly, so I might eventually go with option #1, although maybe will try option #2 as it would be nice for it to be portable occasionally.

The speaker has a microphone - would like to use it, but it is actually located on the base of the speaker so not ideal. Might be better to use a separate microphone and put it either on one side of the speaker (after drilling a hole) or behind the speaker grill.

The speaker amplifier will need some reverse engineering that is currently beyond what I can be bothered worrying about. The speaker is a 4 ohm 5W job, so might simply bypass the motherboard completely and wire the esp up to the speaker via something like a MAX98357 - should be able to pump over 3W to it at 5V. Actually, part of me is tempted to remove the battery & motherboard and just leave the speaker. Hmmm…

Actually, yep. Gonna do that. So plan has finalised. Going to delete the motherboard and battery - will free up more room for the esp and anything else I might want to put in. Will just need to use tape or hot glue to keep the speaker firmly in place. The battery will not be wasted - can use that for something else, for example a 18650 shield costs less than USD$3 and provides good backup power for an esp device.

Would like to have some form of visual feedback when speaking to it, so will want at least two holes - one for the microphone and one for an LED. It is quite possible the glow through the speaker might be enough - will see!

As for the microhone, found the existing microphone on the motherboard presses up against a rubber tube that goes to a hole in the base. Will try using that - if it doesn’t work then will drill a hole in the side of the case.

Parts cost (in $USD):
$6 for speaker & enclosure
$3.40 lolin d32
$2.40 MAX98357 amplifier
$0.25 led
$0.10 wires & solder
Less credit for the battery - worth maybe $3 - total well under $10

Assembly
Initially wanted to use an ESP32-C3 Supermini Plus as it is super compact and has RGB led built in, and if you get one with the external antenna the coverage is excellent. But then found out that the C3 only supports one i2s channel. So can’t have both a speaker AND microphone without complications I can live without. Dang. Changed to a Lolin D32 (I had one spare). As it’s much bigger it meant I had to directly solder wires to it rather than using the dupont cables I prefer when experimenting, but what the hey. (At a pinch it might have just fitted using dupont, but I might later add in extra sensors so the extra space gained is worth the hassle).

I wired everything up - ran power from the micro usb port to the esp32, cabled up the microphone (first removing the pins on it), amplifier, and led then checked that it all worked. (Note that depending on the esp you use, you might be able to line up the esp usb socket with the hole itself and/or you might want to replace the micro with a usb-c port.) Then I hot glued (not my proudest moment, but it does the job) the microphone so the hole on the board lined up with the hole in the base, drilled a hole for the led (it looks like metal, but is just plastic) and glued that in, then shoved everything else inside. Even with the larger Lolin D32 there was stacks of room left for future sensors like temperature, millimeter wave, etc. I haven’t permanently fixed the speaker in place as thought I might need to move the microphone etc - will do that later.


Again, not my proudest work - I will tidy up the cabling etc at some point. As an aside, potentially could mount the microphone and LED into the top of the speaker - there is probably just enough room between the cone of the speaker and the grill - but I took the easy way out for the moment.

In hindsight maybe I should have tried to centre the LED under the “k”… ah well… maybe I can put another one in and give it eyes. :wink:

How does it work? Not bad at all. The microphone is surprisingly sensitive - can pickup commands from across a large room even though the microphone is under the speaker itself. Volume is decent enough, and the speaker sounds good. The only issues I have are really not to do with it but rather the backend. It struggles to accurately understand words. Mis-hears things like “lamp” as “lab”, or joins them together. For example, to get the command “turn on Richard Lamp” to be semi reliable I needed to add in extra aliases under settings/voice/expose/entity such as “richardlamp”, “rigid lamp”, “rigidlamp”, “rigid lab”, “rigidlab”, “Richard’s lamp”, and “Richard’s lab”

Slightly annoying, but can cope with that by adding in extra aliases for the devices that are frequently used. Another slightly annoying thing is media_player is built on the arduino framework, and currently not supported on esp-idf. This matters because “ESP-IDF is needed to include an audio library called ESP_ADF used in our voice assistant”. There are some ways to get around this (eg GitHub - gnumpi/esphome_audio: Custom audio components for ESPHome), but for now I won’t be casting music to the speaker.

More of an issue is it frequently stutters on playback - this apparently may be fixed by tweaking piper and/or moving to openai (as I don’t want to use external services, I might try llama which in turn might help with the voice recognition accuracy). For now though it is still a work in progress, and the Google nest speakers are staying.

Oh, as an aside, I enabled ble tracking on the thing but found that memory usage was too high and caused issues - so had to disable that. It may work on different ESP32 devices that have more memory available, but not the one I am using.

FWIW Here’s my current code:

# Compiled and tested on esphome 2025.4.0 and HA 2025.4.4

# Used INMP441 module for microphone - note that:
#  - Mixed info re L/R pin. Some say can leave unconnected, some say needs to be grounded
#  - L/R pin uses low/high to toggle between left or right using gnd or vcc
#
substitutions:
  devicename: jarvis01
  location: study
  ledpin: GPIO16 # ws2811 - legs left to right 1,2,3,4,notch:  1(din), 2(gnd), 3(5v), 4(dout)
  wspin: GPIO26 #WS or Word Select or Left Right clock
  sckpin: GPIO25 #SCK or Serial Clock or Bit Clock
  dinpin: GPIO14 #Data In or SD or Serial Data
  gpiopin: GPIO13 #DIN Pin of the MAX98357A Audio Amplifier
  # might want to also add in a temp sensor eg DS18B20, HDC1080?
  # might want to add in a proximity sensor eg HLK-LD2410B?

esphome:
  name: $devicename
  friendly_name: $devicename
  min_version: 2024.6.0
  name_add_mac_suffix: false
  project:
    name: ninkasi.ble
    version: '0.1'
  comment: Jarvis LOLIN D32 $location
  platformio_options:
    build_flags:
      - "-D CONFIG_ADC_SUPPRESS_DEPRECATE_WARN=1" # Putting this in temporarily to remove warning “legacy adc calibration driver is deprecated" message during compilation - https://github.com/esphome/issues/issues/5153#issuecomment-1847547482

esp32:
  board: esp32dev
  framework:
    type: esp-idf
    version: recommended
    # Custom sdkconfig options
    sdkconfig_options:
      COMPILER_OPTIMIZATION_SIZE: y
    # Advanced tweaking options
    advanced:
      ignore_efuse_mac_crc: false


# Enable logging
logger:
#  baud_rate: 0  # disable serial uart logging to maybe save a little ram
#  logs:
#    component: ERROR

api:
  encryption:
    key: !secret esphome_encryption_key
  on_client_connected:
        then:
          - delay: 50ms
          - micro_wake_word.start:
  on_client_disconnected:
        then:
          - voice_assistant.stop: 

ota:
  password: !secret ota_password
  platform: esphome

wifi:
  networks:
  - ssid: !secret wifIoT_ssid
    password: !secret wifIoT_password
    priority: 2
# Backup SSID just in case
  - ssid: !secret wifi_ssid
    password: !secret wifi_password
    priority: 1
  # Enable fallback hotspot (captive portal) in case wifi connection fails
  ap:
    ssid: "$devicename Fallback Hotspot"
    password: !secret ota_password

# Remember to install via cable initially if enabling ble tracker below
# Note - enabling this dropped free memory to below 30kb and caused instability

#esp32_ble_tracker:
#  scan_parameters:
#  #  continuous: false
#    active: True
#    interval: 211ms # default 320ms
#    window: 120ms # default 30ms
#bluetooth_proxy:
#  active: true


light:
  - platform: esp32_rmt_led_strip
    id: led
    rgb_order: RGB
    pin:
      number: $ledpin
#      ignore_strapping_warning: true # enable this if you need to use a strapping pin
    num_leds: 1
    chipset: ws2811
    name: "Status LED"
    default_transition_length: 0s
    effects:
      - pulse:
          name: "extra_slow_pulse"
          transition_length: 800ms
          update_interval: 800ms
          min_brightness: 0%
          max_brightness: 30%
      - pulse:
          name: "slow_pulse"
          transition_length: 250ms
          update_interval: 250ms
          min_brightness: 50%
          max_brightness: 100%
      - pulse:
          name: "fast_pulse"
          transition_length: 100ms
          update_interval: 100ms
          min_brightness: 50%
          max_brightness: 100%

switch:
  - platform: template
    id: mute
    name: "Mute microphone"
    optimistic: true
    on_turn_on: 
      - micro_wake_word.stop:
      - voice_assistant.stop:
      - light.turn_on:
          id: led           
          red: 100%
          green: 0%
          blue: 0%
          brightness: 30%
      - delay: 2s
      - light.turn_off:
          id: led
      - light.turn_on:
          id: led      
          red: 100%
          green: 0%
          blue: 0%
          brightness: 30%
    on_turn_off:
      - micro_wake_word.start:
      - light.turn_on:
          id: led         
          red: 0%
          green: 100%
          blue: 0%
          brightness: 60%
          effect: fast pulse 
      - delay: 2s
      - light.turn_off:
          id: led
        
i2s_audio:
  - id: i2s
    i2s_lrclk_pin: $wspin
    i2s_bclk_pin: $sckpin 

microphone:
  - platform: i2s_audio
    id: va_mic
    adc_type: external
    i2s_din_pin: $dinpin
    channel: left
    i2s_audio_id: i2s

output:
  - platform: gpio
    pin: 
      number: $gpiopin
      allow_other_uses: true
    id: set_low_speaker

speaker:
    platform: i2s_audio
    id: va_speaker
    i2s_audio_id: i2s
    dac_type: external
    i2s_dout_pin:   
      number: $gpiopin
      allow_other_uses: true    
    channel: mono
    bits_per_sample: 32bit
    sample_rate: 16000

# Can use the following to provide a volume control
# Note that there can be an impact on voice quality
#number:
#  - platform: template
#    name: "Volume"
#    id: volume
#    unit_of_measurement: "%"
#    min_value: 0
#    max_value: 1
#    step: 0.1
#    mode: SLIDER
#    update_interval: never
#    optimistic: true
#    restore_value: true
#    initial_value: 0.5
#    icon: "mdi:knob"
#    entity_category: config
#    on_value:
#      - speaker.volume_set: !lambda "return x;" 

micro_wake_word:
  models:
    - model: hey_jarvis
  on_wake_word_detected:
    - voice_assistant.start:
    - light.turn_on:
        id: led       
        red: 100%
        green: 100%
        blue: 100%
        brightness: 30%
        effect: scan
    
voice_assistant:
  id: va
  microphone: va_mic
  speaker: va_speaker
  noise_suppression_level: 2.0
  volume_multiplier: 4.0
  on_stt_end:
       then: 
         - light.turn_off: led
  on_error:
          - micro_wake_word.start:  
  on_end:
        then:
          - light.turn_off: led
          - wait_until:
              not:
                voice_assistant.is_running:
          - micro_wake_word.start:  

sensor:
  - platform: uptime
    name: "$devicename Uptime"
  - platform: wifi_signal
    name: "$devicename WiFi Signal"
    update_interval: 60s    
  - platform: template
    name: $devicename free memory
    lambda: return heap_caps_get_free_size(MALLOC_CAP_INTERNAL);
    icon: "mdi:memory"
    entity_category: diagnostic
    state_class: measurement
    unit_of_measurement: "b"
    update_interval: 60s

# Ah. Turns out that media_player is built on the arduino framework. Currently not supported on esp-idf 
# "ESP-IDF is needed to include an audio library called ESP_ADF used in our voice assistant"
# So don't bother with this yet
# Hack here if interested: https://github.com/gnumpi/esphome_audio
#
#
#media_player:
#  - platform: i2s_audio
#    name: Media Player
#    dac_type: external
#    i2s_audio_id: i2s_out
#    i2s_dout_pin: $gpiopin
#    mode: mono
#    id: i2s_media
#    icon: mdi:speaker-wireless
4 Likes

…and what I will say is I suspect that voice assistant as of May 2025 is still a work in progress unless you have a fairly decent local server to support it and/or want to use an external service to power it.

  • Speech recognition struggles - frequently mishearing words, combining them, not hearing a word, etc. At best this leads to having to repeat a command multiple times, taking great care with enunciation and ensuring clear gaps are left between words. At worst it might run an action on the wrong device - eg turning off a tv rather than a light. I know this is not a microphone problem as I can repeat the issue using, for example, voice commands with the home assistant app on my mobile. This issue is the same whether using local whisper or home assistant cloud. This can be improved slightly by using aliases for some entities, but is still not perfect. I suspect that this could be further improved by playing with models and using a faster local server with a GPU to accelerate things.
  • Time to understand and then action a command - when it understands it - takes about five seconds when using (local) whisper compared to one second for google or home assistant cloud. So this would definitely be able to be improved by using a faster local server with a GPU to accelerate things.

As an aside I have also found that text to speech struggles intermittently - currently it is unclear to me why, but basically the voice breaks up and stutters/buffers. Possibly there is an issue with the esp32 being underpowered or perhaps it is not getting a strong enough signal, but for now I will assume that this may be solvable.

I may try setting up a local server with a decent gpu (keeping in mind my original goal was to reduce my reliance on external services) for improved voice to text, text to voice, and command handling via something like ollama, but for now I am putting this back into the ‘maybe in the future’ pile. Instead I will continue to use my Google nest speakers, and maybe use home assistant cloud to expose those home assistant entities that google doesn’t know about directly.