The Tool-Calling Benchmark: 13 Models, Local vs Cloud

April 12, 2026 — benchmarklocaltool-calling — Milo-Bench on GitHub

Twelve models across two benchmark suites, deterministic scoring. The question was simple: how does a locally-run 397B parameter model compare to the top cloud models on agentic tool calling?

The answer was surprising.

Results

Rank	Model	Score	%	Avg Latency	Context	Where
🥇	Qwen3.5-397B-A17B-4bit	58.0/60	96.7%	2.6s	256K	Local
🥇	Claude Opus 4.6	58.0/60	96.7%	2.6s	1M	Cloud API
3	Grok 4.20	57.0/60	95.0%	2.6s	256K	Cloud API
3	Grok 4.1 Fast	57.0/60	95.0%	3.5s	2M	Cloud API
3	Claude Sonnet 4.6	57.0/60	95.0%	1.9s	1M	Cloud API
6	MiniMax M2.7 Q8_K_XL (Unsloth GGUF, llama.cpp, Mac Studio)	56.0/60	93.3%	2.3s	192K	Local
7	GPT-5.4	55.0/60	91.7%	1.2s	1M	Cloud API
7	Gemini 3.1 Pro	55.0/60	91.7%	3.9s	2M	Cloud API
7	Nemotron-3-Super 120B (Ollama, DGX Spark 2)	55.0/60	91.7%	7.0s	128K	Local
10	GLM-5.1-40B-MXFP4 (mlx-community, Mac Studio)	54.8/60	91.3%	17.4s	128K	Local
11	GPT-5.2	53.7/60	89.4%	1.2s	1M	Cloud API
11	MiniMax M2.7 (baa-ai MLX 4-bit)	53.7/60	89.4%	2.6s	256K	Local

The local 397B ties Opus 4.6 and beats every other cloud model. It runs on a Mac Studio M3 Ultra at $0/call after hardware cost. Latency is comparable to cloud at 2.6s average. MiniMax M2.7 Q8_K_XL (Unsloth GGUF on llama.cpp) lands at 93.3% — beating GPT-5.4, Gemini 3.1, and Nemotron-3 while running fully local on the same Mac Studio at 2.3s avg. Nemotron-3-Super 120B on DGX Spark also clears the 90% bar at 91.7%, though at 7s/call latency — viable for async/batch workloads. GLM-5.1-40B-MXFP4 (91.3%, 17.4s avg) joins the 90%+ tier — notable for a 40B active-parameter MoE running locally, though long-context tasks drive the latency up.

What Was Tested

Milo-Bench — 20 prompts, 5 categories, 3 runs each, scored 0–3 per prompt. All checks are deterministic: exact tool name match, argument schema validation, multi-turn chain completion. No rubric, no human judgment.

Single tool — direct function call with correct arguments
Tool selection — choose the right tool from multiple options
Multi-step chains — sequence two or more tool calls
Structured output — return validated JSON schema
Error recovery — handle missing args, retry correctly

Category Breakdown

Category	397B	Opus 4.6	Grok 4.20	Grok 4.1	Sonnet 4.6	M2.7 Q8	GPT-5.4	Gemini 3.1	GPT-5.2	M2.7 MLX
Single tool	3.00	3.00	3.00	3.00	3.00	3.00	3.00	3.00	3.00	3.00
Tool selection	3.00	3.00	3.00	3.00	3.00	3.00	2.50	2.75	2.50	3.00
Multi-step chains	2.50	2.50	2.50	2.50	2.50	2.00	2.42	2.00	2.42	2.00
Structured output	3.00	3.00	3.00	2.75	2.75	3.00	2.50	3.00	2.00	2.42
Error recovery	3.00	3.00	3.00	3.00	3.00	3.00	3.00	3.00	3.00	3.00

Multi-step chains are the hard category — every model drops points here. The failure mode is consistent across models: the first tool call is correct, but the model doesn't follow up with the required second call. This is a prompt engineering / system prompt issue as much as a model capability issue.

Notable Findings

GPT-5.4 is the speed outlier

1.2s average latency — significantly faster than every other model including Sonnet. The tradeoff is accuracy: GPT-5.4 scores 91.7%, missing on tool selection and multi-step chains where other models succeed. For latency-sensitive workloads where accuracy can tolerate some slippage, it's the obvious choice.

Local 397B matches cloud ceiling at lower cost

Qwen3.5-397B-A17B-4bit runs on 416GB of the Mac Studio's 512GB unified memory. Inference via mlx_lm.server 0.31.2, temperature=0, thinking mode disabled. It ties Opus 4.6 — the most capable cloud model tested — on overall score, with identical per-category results. The cost per call is $0 after hardware.

MiniMax M2.7: quant matters

M2.7 has roughly twice the parameters of 397B. On the baa-ai MLX 4-bit weights, it scored 53.7/60 — 7 points behind 397B. Always-on thinking mode is the likely culprit — it reasons through every response including simple JSON generation, which introduces drift. The baa-ai weights also require --trust-remote-code due to a non-standard model architecture.

The Unsloth Q8_K_XL GGUF on llama.cpp tells a different story: 56.0/60 (93.3%), beating GPT-5.4, Gemini 3.1 Pro, and Nemotron-3-Super 120B. Same 230GB on disk, perfect on single-tool / tool-selection / structured output / error recovery, only losing on multi-step chains (2.00/3 — same drop pattern as Gemini 3.1, where the model stops after the first correct call instead of executing the followup). 2.3s avg latency on a Mac Studio M3 Ultra. The quant + runtime swap accounts for the entire 2.3-point lift over the MLX run, with no other config changes.

Round 2 — Suite v1.0, MiniMax M2.7 Q8_K_XL

Round 2 uses Milo-Bench Suite v1.0 — a different test pack than the original 20-prompt run above. v1.0 has 32 tests across 7 categories (tool calling, multi-step, structured output, long context, coding, cost efficiency, agentic workflow) and is scored 0–1 per test. Scores below are not directly comparable to the 0–60 table above — different prompts, different rubric, different scale. Treat this as a complementary look at M2.7 Q8_K_XL specifically, not a replacement for the head-to-head.

Run on 2026-04-25 against MiniMax M2.7 Q8_K_XL (Unsloth, 230GB) via llama.cpp on Mac Studio M3 Ultra 512GB, served on port 8003 with Metal flash attention and 192K context. Tool calls round-tripped through a thin reasoning-content adapter on port 18003 — M2.7 emits final answers in content and chain-of-thought in reasoning_content; the adapter promotes reasoning_content to content only when content is empty (rare).

Category	Score	Tests	Notes
Tool calling	1.00	5/5 perfect	Every call cleanly formed; no schema drift.
Structured output	1.00	5/5 perfect	JSON schema validation passed across all five.
Multi-step	0.90	3 perfect, 1 partial (0.60)	One chain dropped the final step.
Long context	0.83	3/4 perfect	One needle-in-haystack at 0.33; others clean. Latency 30–100s on 100K+ token prompts.
Cost efficiency	0.76	4/4 partial credit	Correct answers, occasional extra tool calls.
Agentic workflow	0.59	1 perfect, 4 partial	End-to-end research/install/deploy tasks; weakest category.
Coding	0.40	2/5 perfect, 3 timeouts	Three coding tasks hit the 110s budget cap before completion.
Overall	0.78	32 tests	Average across categories.

What this changes: the original "M2.7 underperforms its size" finding below was on the baa-ai MLX 4-bit weights with always-on thinking mode. The Q8_K_XL Unsloth GGUF in llama.cpp — same architecture, different quant and runtime — is a meaningfully different story on tool calling and structured output (both 1.00) and on multi-step chains (0.90). It still loses time on long-budget coding tasks where the 40 tok/s generation rate runs out the clock.

Round 4 — Suite v1.0, DeepSeek V4 Pro (Fireworks) vs Local

Run 2026-04-29 against DeepSeek V4 Pro via Fireworks AI API. Same Suite v1.0 harness as Round 2 — 32 tests, 7 categories, scored 0–1 per test. Scores directly comparable to Round 2 above. Avg latency 18.1s (cloud inference vs 2.3s for M2.7 Q8 local).

Category	DeepSeek V4 Pro (Fireworks)	M2.7 Q8_K_XL (Local)	Notes
Tool calling	1.00 (5/5)	1.00 (5/5)	Both perfect.
Structured output	1.00 (5/5)	1.00 (5/5)	Both perfect.
Long context	1.00 (4/4)	0.83 (3/4)	V4 Pro takes this; M2.7 Q8 drops one needle-in-haystack.
Multi-step	0.90 (3.6/4)	0.90 (3.6/4)	Tied — one partial chain each.
Cost efficiency	0.76 (3.05/4)	0.76 (3.05/4)	Tied — occasional extra tool calls.
Coding	0.80 (4/5)	0.40 (2/5)	V4 Pro wins big. M2.7 Q8 hits 3 timeouts at 40 tok/s generation.
Agentic workflow	0.20 (1/5)	0.59 (2.96/5)	M2.7 Q8 wins. 18s avg latency burns through V4 Pro’s 90s budget on complex tasks; all 4 failures were timeouts.
Overall	0.80 (25.65/32)	0.78 (24.95/32)	V4 Pro edges ahead overall.

The latency paradox: DeepSeek V4 Pro scores higher overall (0.80 vs 0.78) but loses the agentic workflow category entirely to a local model. The 18.1s cloud round-trip means a 5-step research/install/deploy task blows past the 90s budget before the last two steps can complete. The local M2.7 Q8 at 2.3s/call has 7× more headroom per step. For budget-limited multi-step work, local latency beats cloud capability.

Where V4 Pro wins: coding (0.80 vs 0.40) and long context (1.00 vs 0.83). Both are single-shot tasks where total wall time matters less than raw capability per call. If your workflow is one-shot code gen or long-doc QA, V4 Pro is the clear choice here.

Round 3 — coding-agent harness, head-to-head vs Qwen3-32B

Run on 2026-04-25 against the ~/clawd/benchmarks/coding-agent/ suite — 23 tasks, multi-turn tool use, scored on tool_call_success / task_completion / rework_rate (composite is a weighted blend). Same harness, same prompts, same grader. Only the orchestrator model changed. Different benchmark from Rounds 1 and 2 above — not directly comparable to those scores.

Metric	Qwen3-32B (Spark 2)	MiniMax M2.7 Q8_K_XL (Mac Studio :8002)
Composite	0.735	0.728
tool_call_success	0.913	0.848
task_completion	0.348	~0.348
rework_rate	0.130	0.000
Pass count	8/23	8/23
Per-task latency	some 90s+ timeouts	5–15s, no timeouts

Identical 8 passing tasks. Qwen3-32B wins on composite by 0.007 — driven entirely by tool_call_success (0.913 vs 0.848). M2.7's tool syntax is looser; it sometimes formats arguments in ways the grader marks as imperfect even when the call succeeds. The interesting numbers are on the other side: M2.7 is 6–20× faster per task, hits zero timeouts, and has rework_rate of 0.000 — it either gets the task or gives up cleanly, no partial-retry thrashing.

Verdict: Qwen3-32B narrowly wins this benchmark. Not worth switching the orchestrator. M2.7 Q8_K_XL's clean failure mode and speed advantage are real, though — worth keeping in mind for latency-sensitive harness work.

Hardware

Local models: Mac Studio M3 Ultra, 512GB unified memory, mlx_lm.server 0.31.2
Cloud models: Direct API calls, default infrastructure
Benchmark host: Same Mac Studio; cloud latency includes network round-trip from Pensacola, FL

Caveats

This is a tool-calling benchmark, not a general capability benchmark. Models that score well here may not rank the same on coding, reasoning, or writing tasks. The 20-prompt suite is intentionally narrow — real agentic workloads are messier.

All runs, raw JSON, and scoring code are in jmeadlock/milo-bench.

← Back to blog