Hi everyone,
I’ve been working on a Wyoming bridge for VoxCPM that runs specifically on Apple Silicon (M1/M2/M3/M4). I wanted a way to use the Apple Neural Engine (ANE) for higher-quality TTS without putting a heavy load on the CPU or needing a dedicated GPU.
Project Details:
- Native Streaming: It supports Wyoming’s
SynthesizeChunkevents, so audio starts playing as it’s being generated rather than waiting for the full sentence. - Zero-Shot Cloning: You can use a short reference
.wavfile to clone a voice. The bridge detects these files and exposes them as selectable voices in Home Assistant. - Hardware: Tested on M-series Mac hardware. It uses a Python bridge to communicate with an ANE-optimized server.
- Protocol: Uses the standard Wyoming protocol, so it integrates directly with Assist.
If you have an M-series Mac acting as a home server, I’d be interested to hear what kind of generation speeds (RTF) you’re seeing.
GitHub: https://github.com/vpsh-code/ANE_VOXCPM_Homeassistant
Suggested README Section for Voice Cloning
Voice Cloning
The server supports zero-shot cloning using local reference clips.