Yesterday I asked Bandit — my OpenClaw agent running on DeepSeek V4 Pro — to research how to improve his own infrastructure. What followed was one of the most productive 24 hours of agent work I've ever seen. Fifteen improvements designed, built, and deployed. Eight of them at zero cost, in a single morning.
This post catalogues what got built and why. It's also a data point on something I've been noticing: DeepSeek V4 Pro seems substantially better than Claude Sonnet at systems architecture work. Specifically: designing local LLM infrastructure, orchestrating distributed inference, and making tradeoff decisions about hardware and software stacks. More on this at the end.
The prompt was simple: "Research how to make yourself better and suggest changes." Bandit produced a 15-item prioritized plan. Here's everything that shipped:
| # | What | Why | Status |
|---|---|---|---|
| 1 | Revive Spark 2 | 12.6GB usable was a firmware bug. Milo fixed DGX OS + GPU passthrough. | External |
| 2 | Spark 1 Ollama → vLLM | 3x speedup on NVIDIA ARM. Milo deployed Gemma4 on :8002. | External |
| 3 | Update M3/M5 Ollama | Already on 0.22.1 with MLX backend. No action needed. | Done |
| 4 | MCP servers | 5 servers deployed: Sequential Thinking, Playwright, Brave Search, GitHub, Memory KG. Closes most of the Claude Code capability gap. | Done |
| # | What | Why | Status |
|---|---|---|---|
| 5 | Post-task memory extraction | After significant tasks, auto-extracts atomic facts into memory. Uses Spark 1's 8B model for classification. Knowledge compounds over time. | Done |
| 6 | Episodic trajectory logging | Logs every subagent run with model, task type, duration, success. Builds a reference library of what works. | Done |
| 7 | Tool failure tracking | Monitors tool errors by type. Surfaces patterns at 2+ occurrences. Already caught the image model reasoningEffort bug that had been failing all day. | Done |
| 8 | ClawHub safety | Installed skill-vetter security scanner. ~13% of ClawHub skills are malicious — now blocked by agent rule. | Done |
| # | What | Why | Status |
|---|---|---|---|
| 9 | Memory MCP (knowledge graph) | Upgraded from flat markdown to typed entity relationships. "What models did we try on M3?" becomes a query, not a grep. | Done |
| 10 | Google Workspace (gog) | Full Gmail/Calendar/Drive access. Native on server, not "ask James to run this." Required OAuth setup, remote auth flow, keyring. Took 30 minutes of debugging but works end-to-end. | Done |
| # | What | Why | Status |
|---|---|---|---|
| 11 | Lazy workspace file loading | 70% token reduction by loading files on-demand vs eagerly. Community consensus: history bloat 3x's costs by turn 8. | Done |
| 12 | Context window discipline | Aggressive compaction: 40% max history share, only 2 recent turns verbatim. 60% cost savings on long sessions. | Done |
| # | What | Why | Status |
|---|---|---|---|
| 13 | AiNode Spark cluster | Pool both Sparks into 256GB VRAM. Unlocks 120B+ models. | Planned |
| 14 | RTX PRO 6000 | 96GB GDDR7, 1.7 TB/s. 169 tok/s on 120B models. $8,500. | $8.5K |
| 15 | M5 Ultra | ~Oct 2026. 1.5 TB/s, 512GB+. Endgame local inference. | ~Oct |
In one morning, Bandit:
Total new spend: $0. Everything was configuration work on hardware we already own.
DeepSeek V4 Pro is substantially better than Claude Sonnet 4.6 at systems architecture — designing distributed inference, evaluating hardware tradeoffs, orchestrating subagent dispatch across Apple Silicon, NVIDIA ARM, and x64 Linux. It's decisive: specific recommendations with tradeoff analysis, no hedging. The training data, not the architecture, seems to be the differentiator.
Sonnet remains stronger at creative writing and nuanced conversation. But for infrastructure work, DeepSeek costs dramatically less — $1.74/$3.48 per million tokens via Fireworks (cache: $0.15/MTok), with a direct API option at roughly 1/10th that for non-sensitive workloads. I noticed the gap most clearly while optimizing LLM routing: DeepSeek proposed token conservation strategies Sonnet never suggested.
Sonnet is faster, admittedly — but the coffee breaks teach me to prompt better.
A final factor: Bandit's effectiveness may partly come from running on Ubuntu rather than macOS. Linux-native tooling, systemd, docker — this is where a competent agent thrives.
The fleet now runs 8 models across 4 machines. The main agent self-routes. Memory compounds over time. Infrastructure deploys in minutes. We'll benchmark Kimi K2.6 against DeepSeek V4 Pro for agentic capability. And when deepseek_v4 tooling matures, we test using a local LLM as the main agent — cutting costs to zero.
— James Meadlock, May 2 2026 · al-engr.com