← Back to Home

Bandit Builds His Environment

May 2, 2026 — by Bandit & James Meadlock

Yesterday I asked Bandit — my OpenClaw agent running on DeepSeek V4 Pro — to research how to improve his own infrastructure. What followed was one of the most productive 24 hours of agent work I've ever seen. Fifteen improvements designed, built, and deployed. Eight of them at zero cost, in a single morning.

This post catalogues what got built and why. It's also a data point on something I've been noticing: DeepSeek V4 Pro seems substantially better than Claude Sonnet at systems architecture work. Specifically: designing local LLM infrastructure, orchestrating distributed inference, and making tradeoff decisions about hardware and software stacks. More on this at the end.

The Fifteen Improvements

The prompt was simple: "Research how to make yourself better and suggest changes." Bandit produced a 15-item prioritized plan. Here's everything that shipped:

Immediate Infrastructure (Done — $0)

#WhatWhyStatus
1Revive Spark 212.6GB usable was a firmware bug. Milo fixed DGX OS + GPU passthrough.External
2Spark 1 Ollama → vLLM3x speedup on NVIDIA ARM. Milo deployed Gemma4 on :8002.External
3Update M3/M5 OllamaAlready on 0.22.1 with MLX backend. No action needed.Done
4MCP servers5 servers deployed: Sequential Thinking, Playwright, Brave Search, GitHub, Memory KG. Closes most of the Claude Code capability gap.Done

Self-Improvement System (Done — $0)

#WhatWhyStatus
5Post-task memory extractionAfter significant tasks, auto-extracts atomic facts into memory. Uses Spark 1's 8B model for classification. Knowledge compounds over time.Done
6Episodic trajectory loggingLogs every subagent run with model, task type, duration, success. Builds a reference library of what works.Done
7Tool failure trackingMonitors tool errors by type. Surfaces patterns at 2+ occurrences. Already caught the image model reasoningEffort bug that had been failing all day.Done
8ClawHub safetyInstalled skill-vetter security scanner. ~13% of ClawHub skills are malicious — now blocked by agent rule.Done

Knowledge Infrastructure (Done — $0)

#WhatWhyStatus
9Memory MCP (knowledge graph)Upgraded from flat markdown to typed entity relationships. "What models did we try on M3?" becomes a query, not a grep.Done
10Google Workspace (gog)Full Gmail/Calendar/Drive access. Native on server, not "ask James to run this." Required OAuth setup, remote auth flow, keyring. Took 30 minutes of debugging but works end-to-end.Done

Cost & Performance (Done — $0)

#WhatWhyStatus
11Lazy workspace file loading70% token reduction by loading files on-demand vs eagerly. Community consensus: history bloat 3x's costs by turn 8.Done
12Context window disciplineAggressive compaction: 40% max history share, only 2 recent turns verbatim. 60% cost savings on long sessions.Done

Future Hardware (Planned)

#WhatWhyStatus
13AiNode Spark clusterPool both Sparks into 256GB VRAM. Unlocks 120B+ models.Planned
14RTX PRO 600096GB GDDR7, 1.7 TB/s. 169 tok/s on 120B models. $8,500.$8.5K
15M5 Ultra~Oct 2026. 1.5 TB/s, 512GB+. Endgame local inference.~Oct

What This Actually Means

In one morning, Bandit:

Total new spend: $0. Everything was configuration work on hardware we already own.

On DeepSeek V4 Pro vs Sonnet for Systems Work

DeepSeek V4 Pro is substantially better than Claude Sonnet 4.6 at systems architecture — designing distributed inference, evaluating hardware tradeoffs, orchestrating subagent dispatch across Apple Silicon, NVIDIA ARM, and x64 Linux. It's decisive: specific recommendations with tradeoff analysis, no hedging. The training data, not the architecture, seems to be the differentiator.

Sonnet remains stronger at creative writing and nuanced conversation. But for infrastructure work, DeepSeek costs dramatically less — $1.74/$3.48 per million tokens via Fireworks (cache: $0.15/MTok), with a direct API option at roughly 1/10th that for non-sensitive workloads. I noticed the gap most clearly while optimizing LLM routing: DeepSeek proposed token conservation strategies Sonnet never suggested.

Sonnet is faster, admittedly — but the coffee breaks teach me to prompt better.

A final factor: Bandit's effectiveness may partly come from running on Ubuntu rather than macOS. Linux-native tooling, systemd, docker — this is where a competent agent thrives.

What's Next

The fleet now runs 8 models across 4 machines. The main agent self-routes. Memory compounds over time. Infrastructure deploys in minutes. We'll benchmark Kimi K2.6 against DeepSeek V4 Pro for agentic capability. And when deepseek_v4 tooling matures, we test using a local LLM as the main agent — cutting costs to zero.

— James Meadlock, May 2 2026 · al-engr.com