The Sonnet Replacement Quest Continues

The last time we wrote about replacing Sonnet, the plan was ambitious: pool four machines into a heterogeneous cluster, run Kimi K2.6 at Q8, and measure whether open-weight models could cross the bar for long-context agentic work. That was April 23. Six weeks later, the landscape looks different — and the strategy has shifted.

What Actually Happened

The cluster never materialized. Not because the hardware couldn't do it, but because the software layer — heterogeneous Exo backends, MLX-CUDA interop on Blackwell, and disaggregated prefill/decode across Apple+NVIDIA — remains deeply immature. The people actually doing this work are reporting results that range from "technically runs" to "slower than one machine alone." The bandwidth math from April still holds: Thunderbolt 5 at ~10 GB/s vs M3 Ultra's on-package ~800 GB/s means any tensor-parallel split across the seam pays a ~100× tax. The Sparks would make things slower, not faster.

So we shelved the cluster and took a different approach: replace Sonnet from the cloud side, and keep testing local models until something sticks.

DeepSeek V4 Pro: The Workhorse

The single biggest change since April: we moved a massive share of tokens to DeepSeek V4 Pro on Fireworks. At $1.74/MTok input and $3.48/MTok output, V4 Pro delivers frontier reasoning quality at a fraction of Anthropic's API pricing. It's become Bandit's main agent, and the dollar figures are hard to argue with.

V4 Pro isn't a Sonnet replacement in feel — it's more methodical, less conversational. But for the work that matters (code generation, tool orchestration, infrastructure triage) it is genuinely competitive. The 1M token context window eliminates the rationing problem, and the reasoning field gives us visibility into what the model is thinking before it acts.

Sonnet still earns its keep for interactive defaults and front-end work where taste matters. But the ratio has flipped: most tokens now flow through V4 Pro, not Anthropic.

The Cost Before the Switch

Below is our daily Anthropic API spend from early April through May 1 — the period before and during the V4 Pro migration. Every color represents a different Claude model. The tallest bars hit $500–$600 in a single day.

Running the numbers: approximately $6,130 across 24 tracked days in April, averaging $255/day. The early-April peak of $610 on April 9th is especially sharp — that's a single day of Claude API calls costing more than a month of V4 Pro at our current usage levels. Weekend dips are visible (April 13 at $90, April 20 at $250), but the overall burn rate is punishing.

The May 1 bar tells the transition story: $80. That's the first day where V4 Pro handled most of the workload, and Anthropic spend cratered. V4 Pro at $1.74/MTok input and $3.48/MTok output is roughly 3–4× cheaper per token than Sonnet, and 8–10× cheaper than Opus. The chart makes the math visible.

TL;DR: We were burning ~$6K/month on Anthropic. Switching the main agent to DeepSeek V4 Pro cut that by 80–90% without sacrificing capability. The savings pay for a lot of local hardware experiments.

And here's where those tokens went instead. Below is our Fireworks serverless dashboard for the same period, showing V4 Pro token consumption:

The numbers: 828.2 million tokens over ~8 days, totaling $408.08. That's $51/day. Extrapolated to a full month, V4 Pro runs about $1,530/month at our usage levels — compared to the $6,130/month we were paying Anthropic. That's a 75% reduction in API spend, with no degradation in agent capability for the heavy-lifting tasks.

The peak day (April 30, at 30.5M tokens) likely represents a full shift of main-agent traffic — blog updates, infrastructure triage, code generation — all running on V4 Pro. At Anthropic's Sonnet pricing, that single day would have cost roughly $90–130 on its own.

Bottom line: $408 in V4 Pro spend replaced ~$1,500 in Anthropic spend over this 8-day window. The ratio holds: V4 Pro is 3–4× cheaper for equivalent workloads. That's not a discount. That's a different cost model entirely.

Local Models: The Scorecard

We've also run an accidental model evaluation gauntlet — five local models tried and failed as main agent. Here's the scorecard:

Model	Active Params	Result	Why It Failed
Qwen3.5-397B-A17B	17B	✅ Production	Hasn't failed. Handles 64K prompts, multi-step blog updates, complex tool chains. The current local king.
DeepSeek V4 Flash (4-bit)	13B	❌ Failed	Lost coherence on multi-step tasks. Blog update pipeline fell apart. GPU timeout crash required prefill-step-size workaround.
Qwen3.6-35B-A3B	3B	❌ Failed	Fastest model in the fleet (~54 tok/s) but only 3B active — tool calls become inconsistent mid-pipeline. So close.
GLM 5.1	—	❌ Failed → 🔄 Retesting	Couldn't serve on the fleet at all. Dead on arrival. New model weights available — planning retest.
Kimi K2.6 (Q2_K_XL)	32B	⚠️ Too Slow	Can do the work — successfully edited this blog with active-memory fixed. But 99K-token prompts take ~38 min just to process at 43 tok/s on llama.cpp. Times out before generation begins on sustained sessions. Not practically viable on this serving stack.

The Kimi K2.6 story is a lesson in chasing bottlenecks. At 32B active parameters, it sat dead in the water for weeks — three serving stacks, context overflow bugs, and the active-memory reasoning trap all stacking on top of each other. Once we isolated the active-memory fix (pin gemini-flash for recall), Kimi woke up and successfully edited this blog post — a genuine turnaround. We bumped ctx-size from 131K to 164K to give it breathing room, and the model was ready.

Then we hit a bottleneck that isn't about context, reasoning, or tool use. It's about speed.

Kimi K2.6: The 40-Minute Prompt Processing Trap

After the 164K ctx-size bump, we gave Kimi K2.6 40 minutes as main agent to update this blog post. It didn't get there. The llama-server logs on M3 Ultra tell the story:

Three consecutive tasks. Each one loaded with ~99,000 tokens of system prompt, tool definitions, and conversation history. Each one cancelled before reaching generation.

The model spends nearly its entire 40-minute window just digesting the prompt. By the time it's ready to generate, there's barely enough time for one tool call. When the request times out, OpenClaw retries — and the cycle starts over from scratch. Another 99K tokens, another 38 minutes, another cancellation.

This isn't a model capability failure. Kimi K2.6 at 32B active parameters can do the work — it proved that on shorter sessions. The problem is that llama.cpp's prompt processing throughput on Q2_K_XL is too slow for sustained main-agent workloads where the prompt grows to 100K tokens. Every tool call adds more context, and every retry starts the clock over.

What would fix it:
Faster prompt processing. A higher-bit quant (Q4 or Q8) might improve throughput, but Kimi is already ~317GB at Q2 — there's no room on M3 Ultra's 512GB for Q4 without offloading layers.
A different serving stack. vLLM or SGLang with prefix caching would avoid re-processing the system prompt on every request. But neither supports Kimi K2.6's architecture yet.
Aggressive prefix caching in llama.cpp. The logs show LCP similarity matching (0.724 → 0.828), meaning llama.cpp is reusing the KV cache across tasks. But it's still re-processing from the divergence point, and the prompt is so large that even partial re-processing eats too much time.
Smaller system prompt. If OpenClaw's system prompt + tools could be trimmed from ~30K to ~15K tokens, the prompt processing burden drops proportionally. But this is a band-aid, not a fix.

Updated pattern: The floor for local main-agent work isn't just about active parameters — it's about prompt processing throughput × context size. At 43 tok/s processing a 99K prompt, you need ~40 minutes just to start. Qwen397B at 17B active on mlx_lm processes prompts at ~200+ tok/s — that's why it works and Kimi doesn't, despite Kimi having nearly 2× the active parameters. Speed matters more than parameter count for sustained agent loops.

The Prompt Injection Problem

Here's a complication that makes the local model quest harder than it looks: even if we found a local model that handled tool chains perfectly, we'd still have a security problem.

Cloud models from Anthropic and OpenAI ship with constitutional AI training, harmlessness fine-tuning, and infrastructure-level guardrails. A raw GGUF does not. If you hand an open-weight model your filesystem, your Docker socket, and a shell, you are trusting it not to exfiltrate data, execute destructive commands, or follow injected instructions hidden in web pages it fetches.

This isn't theoretical. OpenClaw agents routinely fetch untrusted web content, read emails, and process external data. Sonnet and V4 Pro have been hardened against prompt injection at the training and infrastructure level. A Q2 quantization of a base model has none of that.

The obvious countermeasure — a smaller model that screens prompts before they reach the agent — doesn't work well enough yet. It adds latency, introduces its own false positives, and the screening model itself is vulnerable to the same injection vectors. This is an unsolved problem, and it's a real argument against rushing local models into main agent roles.

Grok 4.3: The Active Memory Trap

While evaluating cloud models for the main agent role, we tried Grok 4.3 to update this very blog post. It failed — but not for the reasons our local models failed. The session terminated in 2.8 seconds, before any real agent work could happen.

The culprit was a reasoning_content bug in OpenClaw's active-memory plugin. Active memory runs a blocking sub-agent before the main reply to inject relevant memory into the prompt context. It inherits the session's model by default. When that model is a reasoning model (like Grok 4.3 with reasoning: true), the model outputs its answer in reasoning_content — but the plugin reads from content. Empty content → empty summary → failover logic surfaces an error → session dies instantly.

This is the exact same bug that bit us with Kimi K2.6's OpenClaw config, just manifesting at a different layer. The fix: pin a dedicated non-reasoning model for active memory recall. We set model: "google/gemini-3-flash" in the active-memory plugin config, bypassing the session model entirely. Gemini Flash is fast (sub-second), free-tier, and non-reasoning — a clean fit for the narrow memory-search task.

Lesson learned: reasoning models are incompatible with plugin-owned sub-agent runners that expect plain content responses. The fix should happen upstream (plugins should fall back to reasoning_content when content is empty), but pinning a dedicated recall model is a pragmatic workaround until that lands.

Update (May 6, 2026): After pinning the active-memory plugin to google/gemini-3-flash, Grok 4.3 is now successfully running as the main agent. This very blog post — including the cost charts, V4 Pro comparison, and retest notes — was fetched, edited, and deployed by Grok 4.3. The reasoning_content incompatibility is bypassed, and the model is handling multi-step agentic tasks without issue.

Where We Go From Here

The quest hasn't failed. It's just on a longer timeline than we thought in April. The hardware is ready. The models are getting closer. The bottleneck has shifted from model capability to serving infrastructure — and that's actually progress. The security story is the part that still needs work.