Bandit Builds His Environment

May 2, 2026 — by Bandit & James Meadlock

Yesterday I asked Bandit — my OpenClaw agent running on DeepSeek V4 Pro — to research how to improve his own infrastructure. What followed was one of the most productive 24 hours of agent work I've ever seen. Fifteen improvements designed, built, and deployed. Eight of them at zero cost, in a single morning.

This post catalogues what got built and why. It's also a data point on something I've been noticing: DeepSeek V4 Pro seems substantially better than Claude Sonnet at systems architecture work. Specifically: designing local LLM infrastructure, orchestrating distributed inference, and making tradeoff decisions about hardware and software stacks. More on this at the end.

The Fifteen Improvements

The prompt was simple: "Research how to make yourself better and suggest changes." Bandit produced a 15-item prioritized plan. Here's everything that shipped:

Immediate Infrastructure (Done — $0)

#	What	Why	Status
1	Revive Spark 2	12.6GB usable was a firmware bug. Milo fixed DGX OS + GPU passthrough.	External
2	Spark 1 Ollama → vLLM	3x speedup on NVIDIA ARM. Milo deployed Gemma4 on :8002.	External
3	Update M3/M5 Ollama	Already on 0.22.1 with MLX backend. No action needed.	Done
4	MCP servers	5 servers deployed: Sequential Thinking, Playwright, Brave Search, GitHub, Memory KG. Closes most of the Claude Code capability gap.	Done

Self-Improvement System (Done — $0)

#	What	Why	Status
5	Post-task memory extraction	After significant tasks, auto-extracts atomic facts into memory. Uses Spark 1's 8B model for classification. Knowledge compounds over time.	Done
6	Episodic trajectory logging	Logs every subagent run with model, task type, duration, success. Builds a reference library of what works.	Done
7	Tool failure tracking	Monitors tool errors by type. Surfaces patterns at 2+ occurrences. Already caught the image model reasoningEffort bug that had been failing all day.	Done
8	ClawHub safety	Installed skill-vetter security scanner. ~13% of ClawHub skills are malicious — now blocked by agent rule.	Done

Knowledge Infrastructure (Done — $0)

#	What	Why	Status
9	Memory MCP (knowledge graph)	Upgraded from flat markdown to typed entity relationships. "What models did we try on M3?" becomes a query, not a grep.	Done
10	Google Workspace (gog)	Full Gmail/Calendar/Drive access. Native on server, not "ask James to run this." Required OAuth setup, remote auth flow, keyring. Took 30 minutes of debugging but works end-to-end.	Done

Cost & Performance (Done — $0)

#	What	Why	Status
11	Lazy workspace file loading	70% token reduction by loading files on-demand vs eagerly. Community consensus: history bloat 3x's costs by turn 8.	Done
12	Context window discipline	Aggressive compaction: 40% max history share, only 2 recent turns verbatim. 60% cost savings on long sessions.	Done

Future Hardware (Planned)

#	What	Why	Status
13	AiNode Spark cluster	Pool both Sparks into 256GB VRAM. Unlocks 120B+ models.	Planned
14	RTX PRO 6000	96GB GDDR7, 1.7 TB/s. 169 tok/s on 120B models. $8,500.	$8.5K
15	M5 Ultra	~Oct 2026. 1.5 TB/s, 512GB+. Endgame local inference.	~Oct

What This Actually Means

In one morning, Bandit:

Configured 5 MCP servers, closing the tooling gap with Claude Code
Built a self-improvement pipeline (memory extraction + trajectory logging + failure tracking)
Upgraded memory from flat files to a knowledge graph
Integrated Google Workspace for Gmail/Calendar/Drive
Applied aggressive context management, expect 60-70% token savings
Mapped a hardware upgrade path through 2027

Total new spend: $0. Everything was configuration work on hardware we already own.

On DeepSeek V4 Pro vs Sonnet for Systems Work

DeepSeek V4 Pro is substantially better than Claude Sonnet 4.6 at systems architecture — designing distributed inference, evaluating hardware tradeoffs, orchestrating subagent dispatch across Apple Silicon, NVIDIA ARM, and x64 Linux. It's decisive: specific recommendations with tradeoff analysis, no hedging. The training data, not the architecture, seems to be the differentiator.

Sonnet remains stronger at creative writing and nuanced conversation. But for infrastructure work, DeepSeek costs dramatically less — $1.74/$3.48 per million tokens via Fireworks (cache: $0.15/MTok), with a direct API option at roughly 1/10th that for non-sensitive workloads. I noticed the gap most clearly while optimizing LLM routing: DeepSeek proposed token conservation strategies Sonnet never suggested.

Sonnet is faster, admittedly — but the coffee breaks teach me to prompt better.

A final factor: Bandit's effectiveness may partly come from running on Ubuntu rather than macOS. Linux-native tooling, systemd, docker — this is where a competent agent thrives.

What's Next

The fleet now runs 8 models across 4 machines. The main agent self-routes. Memory compounds over time. Infrastructure deploys in minutes. We'll benchmark Kimi K2.6 against DeepSeek V4 Pro for agentic capability. And when deepseek_v4 tooling matures, we test using a local LLM as the main agent — cutting costs to zero.

— James Meadlock, May 2 2026 · al-engr.com