Twelve models across two benchmark suites, deterministic scoring. The question was simple: how does a locally-run 397B parameter model compare to the top cloud models on agentic tool calling?
The answer was surprising.
| Rank | Model | Score | % | Avg Latency | Context | Where |
|---|---|---|---|---|---|---|
| 🥇 | Qwen3.5-397B-A17B-4bit | 58.0/60 | 96.7% | 2.6s | 256K | Local |
| 🥇 | Claude Opus 4.6 | 58.0/60 | 96.7% | 2.6s | 1M | Cloud API |
| 3 | Grok 4.20 | 57.0/60 | 95.0% | 2.6s | 256K | Cloud API |
| 3 | Grok 4.1 Fast | 57.0/60 | 95.0% | 3.5s | 2M | Cloud API |
| 3 | Claude Sonnet 4.6 | 57.0/60 | 95.0% | 1.9s | 1M | Cloud API |
| 6 | MiniMax M2.7 Q8_K_XL (Unsloth GGUF, llama.cpp, Mac Studio) | 56.0/60 | 93.3% | 2.3s | 192K | Local |
| 7 | GPT-5.4 | 55.0/60 | 91.7% | 1.2s | 1M | Cloud API |
| 7 | Gemini 3.1 Pro | 55.0/60 | 91.7% | 3.9s | 2M | Cloud API |
| 7 | Nemotron-3-Super 120B (Ollama, DGX Spark 2) | 55.0/60 | 91.7% | 7.0s | 128K | Local |
| 10 | GLM-5.1-40B-MXFP4 (mlx-community, Mac Studio) | 54.8/60 | 91.3% | 17.4s | 128K | Local |
| 11 | GPT-5.2 | 53.7/60 | 89.4% | 1.2s | 1M | Cloud API |
| 11 | MiniMax M2.7 (baa-ai MLX 4-bit) | 53.7/60 | 89.4% | 2.6s | 256K | Local |
Milo-Bench — 20 prompts, 5 categories, 3 runs each, scored 0–3 per prompt. All checks are deterministic: exact tool name match, argument schema validation, multi-turn chain completion. No rubric, no human judgment.
| Category | 397B | Opus 4.6 | Grok 4.20 | Grok 4.1 | Sonnet 4.6 | M2.7 Q8 | GPT-5.4 | Gemini 3.1 | GPT-5.2 | M2.7 MLX |
|---|---|---|---|---|---|---|---|---|---|---|
| Single tool | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 |
| Tool selection | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 2.50 | 2.75 | 2.50 | 3.00 |
| Multi-step chains | 2.50 | 2.50 | 2.50 | 2.50 | 2.50 | 2.00 | 2.42 | 2.00 | 2.42 | 2.00 |
| Structured output | 3.00 | 3.00 | 3.00 | 2.75 | 2.75 | 3.00 | 2.50 | 3.00 | 2.00 | 2.42 |
| Error recovery | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 |
Multi-step chains are the hard category — every model drops points here. The failure mode is consistent across models: the first tool call is correct, but the model doesn't follow up with the required second call. This is a prompt engineering / system prompt issue as much as a model capability issue.
1.2s average latency — significantly faster than every other model including Sonnet. The tradeoff is accuracy: GPT-5.4 scores 91.7%, missing on tool selection and multi-step chains where other models succeed. For latency-sensitive workloads where accuracy can tolerate some slippage, it's the obvious choice.
Qwen3.5-397B-A17B-4bit runs on 416GB of the Mac Studio's 512GB unified memory. Inference via mlx_lm.server 0.31.2, temperature=0, thinking mode disabled. It ties Opus 4.6 — the most capable cloud model tested — on overall score, with identical per-category results. The cost per call is $0 after hardware.
M2.7 has roughly twice the parameters of 397B. On the baa-ai MLX 4-bit weights, it scored 53.7/60 — 7 points behind 397B. Always-on thinking mode is the likely culprit — it reasons through every response including simple JSON generation, which introduces drift. The baa-ai weights also require --trust-remote-code due to a non-standard model architecture.
The Unsloth Q8_K_XL GGUF on llama.cpp tells a different story: 56.0/60 (93.3%), beating GPT-5.4, Gemini 3.1 Pro, and Nemotron-3-Super 120B. Same 230GB on disk, perfect on single-tool / tool-selection / structured output / error recovery, only losing on multi-step chains (2.00/3 — same drop pattern as Gemini 3.1, where the model stops after the first correct call instead of executing the followup). 2.3s avg latency on a Mac Studio M3 Ultra. The quant + runtime swap accounts for the entire 2.3-point lift over the MLX run, with no other config changes.
Round 2 uses Milo-Bench Suite v1.0 — a different test pack than the original 20-prompt run above. v1.0 has 32 tests across 7 categories (tool calling, multi-step, structured output, long context, coding, cost efficiency, agentic workflow) and is scored 0–1 per test. Scores below are not directly comparable to the 0–60 table above — different prompts, different rubric, different scale. Treat this as a complementary look at M2.7 Q8_K_XL specifically, not a replacement for the head-to-head.
Run on 2026-04-25 against MiniMax M2.7 Q8_K_XL (Unsloth, 230GB) via llama.cpp on Mac Studio M3 Ultra 512GB, served on port 8003 with Metal flash attention and 192K context. Tool calls round-tripped through a thin reasoning-content adapter on port 18003 — M2.7 emits final answers in content and chain-of-thought in reasoning_content; the adapter promotes reasoning_content to content only when content is empty (rare).
| Category | Score | Tests | Notes |
|---|---|---|---|
| Tool calling | 1.00 | 5/5 perfect | Every call cleanly formed; no schema drift. |
| Structured output | 1.00 | 5/5 perfect | JSON schema validation passed across all five. |
| Multi-step | 0.90 | 3 perfect, 1 partial (0.60) | One chain dropped the final step. |
| Long context | 0.83 | 3/4 perfect | One needle-in-haystack at 0.33; others clean. Latency 30–100s on 100K+ token prompts. |
| Cost efficiency | 0.76 | 4/4 partial credit | Correct answers, occasional extra tool calls. |
| Agentic workflow | 0.59 | 1 perfect, 4 partial | End-to-end research/install/deploy tasks; weakest category. |
| Coding | 0.40 | 2/5 perfect, 3 timeouts | Three coding tasks hit the 110s budget cap before completion. |
| Overall | 0.78 | 32 tests | Average across categories. |
What this changes: the original "M2.7 underperforms its size" finding below was on the baa-ai MLX 4-bit weights with always-on thinking mode. The Q8_K_XL Unsloth GGUF in llama.cpp — same architecture, different quant and runtime — is a meaningfully different story on tool calling and structured output (both 1.00) and on multi-step chains (0.90). It still loses time on long-budget coding tasks where the 40 tok/s generation rate runs out the clock.
Run 2026-04-29 against DeepSeek V4 Pro via Fireworks AI API. Same Suite v1.0 harness as Round 2 — 32 tests, 7 categories, scored 0–1 per test. Scores directly comparable to Round 2 above. Avg latency 18.1s (cloud inference vs 2.3s for M2.7 Q8 local).
| Category | DeepSeek V4 Pro (Fireworks) | M2.7 Q8_K_XL (Local) | Notes |
|---|---|---|---|
| Tool calling | 1.00 (5/5) | 1.00 (5/5) | Both perfect. |
| Structured output | 1.00 (5/5) | 1.00 (5/5) | Both perfect. |
| Long context | 1.00 (4/4) | 0.83 (3/4) | V4 Pro takes this; M2.7 Q8 drops one needle-in-haystack. |
| Multi-step | 0.90 (3.6/4) | 0.90 (3.6/4) | Tied — one partial chain each. |
| Cost efficiency | 0.76 (3.05/4) | 0.76 (3.05/4) | Tied — occasional extra tool calls. |
| Coding | 0.80 (4/5) | 0.40 (2/5) | V4 Pro wins big. M2.7 Q8 hits 3 timeouts at 40 tok/s generation. |
| Agentic workflow | 0.20 (1/5) | 0.59 (2.96/5) | M2.7 Q8 wins. 18s avg latency burns through V4 Pro’s 90s budget on complex tasks; all 4 failures were timeouts. |
| Overall | 0.80 (25.65/32) | 0.78 (24.95/32) | V4 Pro edges ahead overall. |
The latency paradox: DeepSeek V4 Pro scores higher overall (0.80 vs 0.78) but loses the agentic workflow category entirely to a local model. The 18.1s cloud round-trip means a 5-step research/install/deploy task blows past the 90s budget before the last two steps can complete. The local M2.7 Q8 at 2.3s/call has 7× more headroom per step. For budget-limited multi-step work, local latency beats cloud capability.
Where V4 Pro wins: coding (0.80 vs 0.40) and long context (1.00 vs 0.83). Both are single-shot tasks where total wall time matters less than raw capability per call. If your workflow is one-shot code gen or long-doc QA, V4 Pro is the clear choice here.
Run on 2026-04-25 against the ~/clawd/benchmarks/coding-agent/ suite — 23 tasks, multi-turn tool use, scored on tool_call_success / task_completion / rework_rate (composite is a weighted blend). Same harness, same prompts, same grader. Only the orchestrator model changed. Different benchmark from Rounds 1 and 2 above — not directly comparable to those scores.
| Metric | Qwen3-32B (Spark 2) | MiniMax M2.7 Q8_K_XL (Mac Studio :8002) |
|---|---|---|
| Composite | 0.735 | 0.728 |
| tool_call_success | 0.913 | 0.848 |
| task_completion | 0.348 | ~0.348 |
| rework_rate | 0.130 | 0.000 |
| Pass count | 8/23 | 8/23 |
| Per-task latency | some 90s+ timeouts | 5–15s, no timeouts |
Identical 8 passing tasks. Qwen3-32B wins on composite by 0.007 — driven entirely by tool_call_success (0.913 vs 0.848). M2.7's tool syntax is looser; it sometimes formats arguments in ways the grader marks as imperfect even when the call succeeds. The interesting numbers are on the other side: M2.7 is 6–20× faster per task, hits zero timeouts, and has rework_rate of 0.000 — it either gets the task or gives up cleanly, no partial-retry thrashing.
Verdict: Qwen3-32B narrowly wins this benchmark. Not worth switching the orchestrator. M2.7 Q8_K_XL's clean failure mode and speed advantage are real, though — worth keeping in mind for latency-sensitive harness work.
This is a tool-calling benchmark, not a general capability benchmark. Models that score well here may not rank the same on coding, reasoning, or writing tasks. The 20-prompt suite is intentionally narrow — real agentic workloads are messier.
All runs, raw JSON, and scoring code are in jmeadlock/milo-bench.