Milo Voice Cloner: Fine-Tuning Qwen3-TTS on a DGX Spark

What This Is

Milo needs a voice. Not a generic TTS voice — a real one. The plan is to fine-tune a voice model on James's audio and deploy it as the primary TTS for the MiloBridge voice pipeline: real-time, local, zero cloud dependency.

Will Prowse is the test case. His voice is clean, expressive, and he has hundreds of hours of podcast-quality audio publicly available. If the system can clone Will convincingly, it can clone anyone with a decent corpus.

The model is Qwen3-TTS-12Hz-1.7B — Alibaba's open TTS architecture that uses a 12Hz codec and 16-layer codec head to generate audio autoregressively. Fine-tuning injects a new speaker identity into a specific embedding slot rather than modifying the architecture.

Training Configuration — v8

Base Model

Qwen3-TTS-12Hz-1.7B

full SFT (not LoRA)

Learning Rate

2e-5

AdamW, no scheduler

Corpus

361 samples

365 WAV chunks, ~249MB

Epochs

25 (of 50)

training stopped at convergence

Loss (final)

9.7 – 10.8

CE summed across 16 codec layers

Hardware

DGX Spark 2

GB10 · ~45GB usable VRAM

Speaker Slot

codec_embedding[3000]

will_prowse baked in at save

Inference RTF

1.15×

no flash-attn, bfloat16

On the loss numbers: CE loss summed across 16 codec layers looks alarming (~10) but is normal. Per-layer average is ~0.65 — well below the random baseline of ~5 for a 1024-token codec vocabulary. The model converged by epoch 5; epochs 6–24 added no improvement. Don't read too much into the absolute value.

Version History — What Failed and Why

Run	LR	Epochs	Result	Status
v1–v2	varied	—	FasterQwen3TTS (0.6B) + LoRA adapter approach. Wrong architecture — v8 uses full 1.7B model with merged weights, not LoRA delta.	dead end
v3	unknown	25	Epoch-3 checkpoint was the only one that ever produced valid audio. Higher epochs degraded. Set the benchmark for what "working" sounds like.	reference
v4	—	—	Data pipeline only — produced train_with_codes.jsonl (361 samples). No training output; used as data source for v5+.	data
v5	1e-4	~10	Catastrophic overfit. Loss collapsed, output became noise. lr=1e-4 is too aggressive for a 361-sample corpus.	overfit
v6	1e-5	unknown	Produced a 15MB output file but garbage audio. Too conservative — model didn't shift far enough from the base in available epochs.	underfit
v7	—	—	Intermediate experiment, results not documented.	skipped
v8	2e-5	25/50	Loss stable 9.7–10.8, no collapse, inference confirmed working. RTF 1.15, 7.28s audio generated clean. Current production candidate.	current

The Inference Fix That Took Too Long

Every test before v8 failed at inference with HF repo format mismatch. The training script (sft_12hz.py) copies the base model into each checkpoint directory with shutil.copytree(), but config.json inside retains the original _name_or_path field pointing to the HuggingFace registry slug:

# What was in every checkpoint's config.json: { "_name_or_path": "Qwen/Qwen3-TTS-12Hz-1.7B-Base", ← this caused the HF lookup "tts_model_type": "custom_voice", "spk_id": {"will_prowse": 3000} } # What from_pretrained() does with a slash-name: # → tries HuggingFace API → fails → "format mismatch" # Fix: pass the absolute local directory path directly. # from_pretrained() is smart enough to bypass HF when given a real path.

The correct call — no patches to config.json needed:

from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel tts = Qwen3TTSModel.from_pretrained( "/home/milo/voice-pipeline/tts/will-lora-v8/checkpoint/checkpoint-epoch-24", device_map="cuda:0", dtype=torch.bfloat16, ) wavs, sr = tts.generate_custom_voice( text="Hey, in today's video I'm going to show you something interesting.", speaker="will_prowse", # must match the key in config.json spk_id ) # → no ref_audio needed — speaker embedding is baked in at index 3000 sf.write("output.wav", wavs[0], sr)

Critical: The old gen_will_v3.py used FasterQwen3TTS (the 0.6B model) with PeftModel.from_pretrained() for LoRA adapter loading. v8 checkpoints are full merged model weights — the LoRA approach is wrong and will produce garbage or fail outright. Use Qwen3TTSModel directly.

Results

Epoch 24 checkpoint loads clean, generates audio without errors. RTF 1.15 without flash-attn means generation is slightly slower than real-time — acceptable, and will drop below 1.0 once flash-attn is installed.

Load time

32.2 s

one-time on server startup

Generation

7.28 s audio

183-char input, 8.4s wall time

RTF

1.15

eager attn. flash-attn expected <1.0

Output format

24,000 Hz WAV

16-bit, mono

The voice quality question is still open — "working" means the pipeline runs without errors and produces valid audio. Whether it actually sounds like Will Prowse is a human judgment call. If it doesn't, the most likely culprit is reference audio quality, not epoch count or learning rate. Fix the corpus, retrain for 5 epochs.

What's Next

Evaluate ep-24 audio quality against the v3 ep-3 reference
If voice character is wrong: audit will-corpus-v4 for noise, compression artifacts, and speaker consistency
If voice character is right: build a FastAPI inference server wrapping gen_will_v8.py, wire into MiloBridge v2 as the primary TTS endpoint
Record James's voice corpus (30–60 min, scripted) for Milo's actual production voice
LoRA fine-tune Cindy's voice using the same pipeline

The playbook: A reusable voice training guide covering all 6 pipeline stages, checkpoint anatomy, loss interpretation, and how to adapt for a new speaker (James, Cindy) is in ~/clawd/projects/voice-pipeline/VOICE-TRAINING-PLAYBOOK.md.

Milo Voice Cloner

Training Pipeline

What This Is

Training Configuration — v8

Version History — What Failed and Why

The Inference Fix That Took Too Long

Results

What's Next