Text to speech speed benchmarks for Voice

Hi all,

After receiving my Voice PE I found myself wondering which Faster Whisper model I should be using, but found no good information on their performance requirements. To understand what models I should be trying, I wrote a quick benchmark script.

Results

It is not adviced to reduce beam width due to insignificant effect on speed.

The default base-int8 model in Home assistant is a good choice, but I will personally experiment more with the base.en and small-int8 models.

Repository


Test Method

Six voice recordings from a Home Assistant Voice Preview Edition were captured.
To capture voice recordings add the following to configuration.yaml

assist_pipeline:
  debug_recording_dir: /share/assist_pipeline

The recordings are spoken in English with a Finnish accent and include the following phrases:

  1. Turn off the lights
  2. Set the lights to maximum brightness
  3. What's the temperature?
  4. What's the temperature of the heat pump?
  5. Turn the lights to half brightness
  6. Turn off the lights

These recordings were processed using the whisper models available in the Home Assistant Voice pipeline, with various beam width settings. Note that the large and turbo models were too slow to be relevant for the results. The processing time for each configuration was averaged and recorded.


Test Systems

Hardware

  • System 1: Intel i7-1360P
    • 16GB 4800 MT/s dual channel RAM
  • System 2: Intel N100
    • 32GB 3200 MT/s single channel RAM

Host Environment

  • Host OS: Proxmox
  • Virtual Machine: Ubuntu Server 24.04
    • Full allocation of CPU cores
    • CPU type: Host
    • 12GB of RAM

Benchmark Images

Performance Results on Intel i7-1360P

Performance Results on Intel N100

Benchmark Results for Beam Width