MiloBridge v1: Voice Pipeline Goes Live
April 5, 2026
Phase 1 is done. The full pipeline works: squeeze AirPods stem → speak → hear a response → 1.5 seconds total.
That's not a benchmark number. That's measured wall-clock time from push to first audio out in warm state. It's fast enough to feel natural.
The stack
STT: FluidAudio Parakeet TDT v3 CoreML — runs on the Mac Studio ANE. 86ms average latency, 100% accuracy on clean mic input, $0/query. 4x faster than the previous Whisper baseline. The difference between "feels like a tool" and "feels like talking to something" is partly this number.
LLM: Claude Haiku via OpenClaw — 1.1s time-to-first-token. This dominates total latency. Everything else is noise by comparison.
TTS: Orpheus on Mac Studio :5005 — first chunk in ~130ms. Sounds better than ElevenLabs for our use case. Runs locally, no subscription.
What Phase 2 looks like
G2 smart glasses integration. 567 lines of Swift written. Blocked on one thing: the 7-packet BLE authentication handshake. Without it, the glasses won't accept display commands or stream microphone audio. The protocol is reverse-engineered and documented — just hasn't been ported to Swift yet.
Phase 3 is voice cloning on Spark 2 — training a Will Prowse voice model to use as the default TTS voice. Early checkpoints sound promising.
Bugs fixed on the way to v1
Six of them, in order of how annoying they were: STT metadata leaking into LLM context, LLM routing to the wrong provider, TTS chain breaking on empty responses, AirPods HFP mode not activating with the right AVAudioSession flag, stale system prompt persisting across sessions, conversation trim off-by-one dropping the most recent exchange.
None were interesting. All were necessary.