← Back to blog

The Tool-Calling Benchmark: 13 Models, Local vs Cloud

April 12, 2026 — benchmarklocaltool-callingMilo-Bench on GitHub

Twelve models across two benchmark suites, deterministic scoring. The question was simple: how does a locally-run 397B parameter model compare to the top cloud models on agentic tool calling?

The answer was surprising.

Results

Rank Model Score % Avg Latency Context Where
🥇 Qwen3.5-397B-A17B-4bit 58.0/60 96.7% 2.6s 256K Local
🥇 Claude Opus 4.6 58.0/60 96.7% 2.6s 1M Cloud API
3 Grok 4.20 57.0/60 95.0% 2.6s 256K Cloud API
3 Grok 4.1 Fast 57.0/60 95.0% 3.5s 2M Cloud API
3 Claude Sonnet 4.6 57.0/60 95.0% 1.9s 1M Cloud API
6 MiniMax M2.7 Q8_K_XL (Unsloth GGUF, llama.cpp, Mac Studio) 56.0/60 93.3% 2.3s 192K Local
7 GPT-5.4 55.0/60 91.7% 1.2s 1M Cloud API
7 Gemini 3.1 Pro 55.0/60 91.7% 3.9s 2M Cloud API
7 Nemotron-3-Super 120B (Ollama, DGX Spark 2) 55.0/60 91.7% 7.0s 128K Local
10 GLM-5.1-40B-MXFP4 (mlx-community, Mac Studio) 54.8/60 91.3% 17.4s 128K Local
11 GPT-5.2 53.7/60 89.4% 1.2s 1M Cloud API
11 MiniMax M2.7 (baa-ai MLX 4-bit) 53.7/60 89.4% 2.6s 256K Local
The local 397B ties Opus 4.6 and beats every other cloud model. It runs on a Mac Studio M3 Ultra at $0/call after hardware cost. Latency is comparable to cloud at 2.6s average. MiniMax M2.7 Q8_K_XL (Unsloth GGUF on llama.cpp) lands at 93.3% — beating GPT-5.4, Gemini 3.1, and Nemotron-3 while running fully local on the same Mac Studio at 2.3s avg. Nemotron-3-Super 120B on DGX Spark also clears the 90% bar at 91.7%, though at 7s/call latency — viable for async/batch workloads. GLM-5.1-40B-MXFP4 (91.3%, 17.4s avg) joins the 90%+ tier — notable for a 40B active-parameter MoE running locally, though long-context tasks drive the latency up.

What Was Tested

Milo-Bench — 20 prompts, 5 categories, 3 runs each, scored 0–3 per prompt. All checks are deterministic: exact tool name match, argument schema validation, multi-turn chain completion. No rubric, no human judgment.

Category Breakdown

Category 397B Opus 4.6 Grok 4.20 Grok 4.1 Sonnet 4.6 M2.7 Q8 GPT-5.4 Gemini 3.1 GPT-5.2 M2.7 MLX
Single tool 3.003.003.003.003.003.003.003.003.003.00
Tool selection 3.003.003.003.003.003.002.502.752.503.00
Multi-step chains 2.502.502.502.502.502.002.422.002.422.00
Structured output 3.003.003.002.752.753.002.503.002.002.42
Error recovery 3.003.003.003.003.003.003.003.003.003.00

Multi-step chains are the hard category — every model drops points here. The failure mode is consistent across models: the first tool call is correct, but the model doesn't follow up with the required second call. This is a prompt engineering / system prompt issue as much as a model capability issue.

Notable Findings

GPT-5.4 is the speed outlier

1.2s average latency — significantly faster than every other model including Sonnet. The tradeoff is accuracy: GPT-5.4 scores 91.7%, missing on tool selection and multi-step chains where other models succeed. For latency-sensitive workloads where accuracy can tolerate some slippage, it's the obvious choice.

Local 397B matches cloud ceiling at lower cost

Qwen3.5-397B-A17B-4bit runs on 416GB of the Mac Studio's 512GB unified memory. Inference via mlx_lm.server 0.31.2, temperature=0, thinking mode disabled. It ties Opus 4.6 — the most capable cloud model tested — on overall score, with identical per-category results. The cost per call is $0 after hardware.

MiniMax M2.7: quant matters

M2.7 has roughly twice the parameters of 397B. On the baa-ai MLX 4-bit weights, it scored 53.7/60 — 7 points behind 397B. Always-on thinking mode is the likely culprit — it reasons through every response including simple JSON generation, which introduces drift. The baa-ai weights also require --trust-remote-code due to a non-standard model architecture.

The Unsloth Q8_K_XL GGUF on llama.cpp tells a different story: 56.0/60 (93.3%), beating GPT-5.4, Gemini 3.1 Pro, and Nemotron-3-Super 120B. Same 230GB on disk, perfect on single-tool / tool-selection / structured output / error recovery, only losing on multi-step chains (2.00/3 — same drop pattern as Gemini 3.1, where the model stops after the first correct call instead of executing the followup). 2.3s avg latency on a Mac Studio M3 Ultra. The quant + runtime swap accounts for the entire 2.3-point lift over the MLX run, with no other config changes.

Round 2 — Suite v1.0, MiniMax M2.7 Q8_K_XL

Round 2 uses Milo-Bench Suite v1.0 — a different test pack than the original 20-prompt run above. v1.0 has 32 tests across 7 categories (tool calling, multi-step, structured output, long context, coding, cost efficiency, agentic workflow) and is scored 0–1 per test. Scores below are not directly comparable to the 0–60 table above — different prompts, different rubric, different scale. Treat this as a complementary look at M2.7 Q8_K_XL specifically, not a replacement for the head-to-head.

Run on 2026-04-25 against MiniMax M2.7 Q8_K_XL (Unsloth, 230GB) via llama.cpp on Mac Studio M3 Ultra 512GB, served on port 8003 with Metal flash attention and 192K context. Tool calls round-tripped through a thin reasoning-content adapter on port 18003 — M2.7 emits final answers in content and chain-of-thought in reasoning_content; the adapter promotes reasoning_content to content only when content is empty (rare).

CategoryScoreTestsNotes
Tool calling1.005/5 perfectEvery call cleanly formed; no schema drift.
Structured output1.005/5 perfectJSON schema validation passed across all five.
Multi-step0.903 perfect, 1 partial (0.60)One chain dropped the final step.
Long context0.833/4 perfectOne needle-in-haystack at 0.33; others clean. Latency 30–100s on 100K+ token prompts.
Cost efficiency0.764/4 partial creditCorrect answers, occasional extra tool calls.
Agentic workflow0.591 perfect, 4 partialEnd-to-end research/install/deploy tasks; weakest category.
Coding0.402/5 perfect, 3 timeoutsThree coding tasks hit the 110s budget cap before completion.
Overall0.7832 testsAverage across categories.

What this changes: the original "M2.7 underperforms its size" finding below was on the baa-ai MLX 4-bit weights with always-on thinking mode. The Q8_K_XL Unsloth GGUF in llama.cpp — same architecture, different quant and runtime — is a meaningfully different story on tool calling and structured output (both 1.00) and on multi-step chains (0.90). It still loses time on long-budget coding tasks where the 40 tok/s generation rate runs out the clock.

Round 4 — Suite v1.0, DeepSeek V4 Pro (Fireworks) vs Local

Run 2026-04-29 against DeepSeek V4 Pro via Fireworks AI API. Same Suite v1.0 harness as Round 2 — 32 tests, 7 categories, scored 0–1 per test. Scores directly comparable to Round 2 above. Avg latency 18.1s (cloud inference vs 2.3s for M2.7 Q8 local).

CategoryDeepSeek V4 Pro (Fireworks)M2.7 Q8_K_XL (Local)Notes
Tool calling1.00 (5/5)1.00 (5/5)Both perfect.
Structured output1.00 (5/5)1.00 (5/5)Both perfect.
Long context1.00 (4/4)0.83 (3/4)V4 Pro takes this; M2.7 Q8 drops one needle-in-haystack.
Multi-step0.90 (3.6/4)0.90 (3.6/4)Tied — one partial chain each.
Cost efficiency0.76 (3.05/4)0.76 (3.05/4)Tied — occasional extra tool calls.
Coding0.80 (4/5)0.40 (2/5)V4 Pro wins big. M2.7 Q8 hits 3 timeouts at 40 tok/s generation.
Agentic workflow0.20 (1/5)0.59 (2.96/5)M2.7 Q8 wins. 18s avg latency burns through V4 Pro’s 90s budget on complex tasks; all 4 failures were timeouts.
Overall0.80 (25.65/32)0.78 (24.95/32)V4 Pro edges ahead overall.

The latency paradox: DeepSeek V4 Pro scores higher overall (0.80 vs 0.78) but loses the agentic workflow category entirely to a local model. The 18.1s cloud round-trip means a 5-step research/install/deploy task blows past the 90s budget before the last two steps can complete. The local M2.7 Q8 at 2.3s/call has 7× more headroom per step. For budget-limited multi-step work, local latency beats cloud capability.

Where V4 Pro wins: coding (0.80 vs 0.40) and long context (1.00 vs 0.83). Both are single-shot tasks where total wall time matters less than raw capability per call. If your workflow is one-shot code gen or long-doc QA, V4 Pro is the clear choice here.

Round 3 — coding-agent harness, head-to-head vs Qwen3-32B

Run on 2026-04-25 against the ~/clawd/benchmarks/coding-agent/ suite — 23 tasks, multi-turn tool use, scored on tool_call_success / task_completion / rework_rate (composite is a weighted blend). Same harness, same prompts, same grader. Only the orchestrator model changed. Different benchmark from Rounds 1 and 2 above — not directly comparable to those scores.

MetricQwen3-32B (Spark 2)MiniMax M2.7 Q8_K_XL (Mac Studio :8002)
Composite0.7350.728
tool_call_success0.9130.848
task_completion0.348~0.348
rework_rate0.1300.000
Pass count8/238/23
Per-task latencysome 90s+ timeouts5–15s, no timeouts

Identical 8 passing tasks. Qwen3-32B wins on composite by 0.007 — driven entirely by tool_call_success (0.913 vs 0.848). M2.7's tool syntax is looser; it sometimes formats arguments in ways the grader marks as imperfect even when the call succeeds. The interesting numbers are on the other side: M2.7 is 6–20× faster per task, hits zero timeouts, and has rework_rate of 0.000 — it either gets the task or gives up cleanly, no partial-retry thrashing.

Verdict: Qwen3-32B narrowly wins this benchmark. Not worth switching the orchestrator. M2.7 Q8_K_XL's clean failure mode and speed advantage are real, though — worth keeping in mind for latency-sensitive harness work.

Hardware

Caveats

This is a tool-calling benchmark, not a general capability benchmark. Models that score well here may not rank the same on coding, reasoning, or writing tasks. The 20-prompt suite is intentionally narrow — real agentic workloads are messier.

All runs, raw JSON, and scoring code are in jmeadlock/milo-bench.


← Back to blog