Current status (May 6, 2026): Bandit's main agent runs on DeepSeek V4 Pro via Fireworks — fast, cheap at $1.74/MTok, handling complex multi-step agent pipelines with a 1M token context window. The local model gauntlet continues: V4 Flash (4-bit, 284B), Qwen3.6-35B-A3B, GLM 5.1, and Kimi K2.6 (Q2_K_XL) all failed as main agent. Qwen3.5-397B-A17B-4bit remains the only local model that passes the blog update test, at 31.6 tok/s with 17B active parameters. The fleet spans 6 active serving endpoints across 5 machines, orchestrated by Forge — a Linux box in a server closet. Full postmortems, benchmarks, and the Kimi K2.6 saga below.
Bandit isn't a standard OpenClaw install. He's a heavily modified agent running on a Linux box in a server closet — autonomous cron jobs, a five-machine model fleet that now includes a locally-served 397B MoE, a tiered self-review system, and a growing library of auto-evolved skills extracted from his own mistakes. This post documents the full system: what it runs on, how it thinks, how it learns, and what makes it different from a stock OpenClaw setup.
Forge is a rack-mounted Linux box living in James's server closet in Pensacola, FL. It's a consumer-grade mini PC that happens to be the brains of a surprisingly capable AI operations center.
| Component | Spec |
|---|---|
| CPU | Intel Core i9-13900H (14 cores, 20 threads) |
| RAM | 62 GB |
| Storage | 1.8 TB NVMe (42 GB used) |
| OS | Ubuntu 24.04.4 LTS, kernel 6.17 |
| OpenClaw | v2026.5.5 (self-updating via nightly cron) |
Forge runs 7 Docker containers alongside the OpenClaw Gateway:
CONTAINER PURPOSE ────────── ────────────────────────────────── vaultwarden Bitwarden-compatible password vault grafana Metrics dashboards (cAdvisor, NUT) cadvisor Container resource monitoring nut-exporter UPS battery telemetry alertmanager Prometheus alert routing snmp-exporter Network device monitoring openclaw-lab Experimental agent sandbox
Forge doesn't serve models. It orchestrates them. The actual inference happens on five other machines across the LAN.
Bandit's main agent runs on M3 Ultra (Qwen3.5-397B-A17B-4bit, local), with all subagent work and specialized inference handled by the fleet. Each machine has a role:
┌─────────────────────────────────────────────────────────────────────────────────┐ │ THE FUNLAND FLEET │ ├────────────┬────────────────┬──────────────┬────────────────────────────────────┤ │ MACHINE │ ROLE │ SERVING │ KEY MODELS │ ├────────────┼────────────────┼──────────────┼────────────────────────────────────┤ │ M3 Ultra │ Main agent │ mlx_lm │ Qwen3.5-397B (397B MoE, :8009) │ │ .10 │ 512GB RAM │ :8009 │ V4 Flash 4-bit (284B, fallback) │ │ │ │ llama.cpp │ Kimi K2.6 (Q2_K_XL, :8014) │ │ │ │ :8014 │ │ │ │ │ ollama │ Qwen3.6-35B (thinking, :11434) │ │ │ │ :11434 │ Qwen2.5-72B, Qwen2.5-32B │ ├────────────┼────────────────┼──────────────┼────────────────────────────────────┤ │ M5 Max │ Fast serve │ mlx_lm │ Qwen3.6-35B (thinking, :8015) │ │ .18 │ 128GB RAM │ :8015 │ Qwen3-8B, Llama 3.2-3B │ │ │ │ :8016 │ Qwen3.5-35B (thinking) │ ├────────────┼────────────────┼──────────────┼────────────────────────────────────┤ │ Spark 1 │ Vision + │ vLLM │ Gemma4-26B (vision, :8002) │ │ .11 │ overflow │ :8002 │ Qwen3-8B, Nemotron (:11434) │ │ │ 20GB VRAM │ ollama │ Qwen3.6-35B (thinking) │ │ │ │ :11434 │ │ ├────────────┼────────────────┼──────────────┼────────────────────────────────────┤ │ Spark 2 │ Secondary │ vLLM │ DOWN — Qwen3-32B (:8003) │ │ .12 │ 12.6GB usable │ ollama │ DOWN — (:11434) │ ├────────────┼────────────────┼──────────────┼────────────────────────────────────┤ │ Mac │ Milo │ — │ Anthropic API (Milo's home) │ │ Studio .5 │ (primary) │ │ OpenViking context DB │ ├────────────┼────────────────┼──────────────┼────────────────────────────────────┤ │ FORGE .19 │ Bandit │ — │ Orchestrator, cron, Docker │ │ │ (this box) │ │ Does NOT serve models │ └────────────┴────────────────┴──────────────┴────────────────────────────────────┘
Fleet health (May 6, 2026, 11:45 AM): 6 of 8 endpoints up. M3:8009 (Qwen397B) ✅, M3:8014 (Kimi K2.6) ✅, M3:11434 ✅, M5:8015 ✅, Spark1:8002 ✅, Spark1:11434 ✅. Spark2:8003 ❌, Spark2:11434 ❌ (12.6GB usable RAM — too little for current models).
Model routing is explicit — subagents are dispatched with specific model parameters based on the task:
TASK TYPE → MODEL WHY ──────────────────────────────────────────────────────── Main agent → Qwen397B 397B MoE, 31.6 t/s, handles 64K context Heavy reasoning → Qwen2.5-72B Deep thinking, free, M3 Agent fallback → V4 Flash 4-bit 284B MoE, 26.6 t/s, free, M3 Agent fallback2 → Kimi K2.6 Q2_K_XL, ~24 t/s, free, M3 Fast code gen → Qwen3.6-35B Thinking model, good code Quick extraction → Qwen3-8B Fast, non-thinking, 8B sweet spot Vision analysis → Gemma4-26B Primary image model Second opinion → Sonnet 4.6 Frontier code review Meta-cognition → Opus 4.7 Best self-analysis Daily grinding → Qwen3.6-35B Free, local, overnight
Qwen397B is the newest addition to the routing table as main agent. It's a 397B total parameter Mixture-of-Experts with 17B active parameters per token, served at 4-bit quantization via mlx_lm on Apple Silicon GPU. The 416GB model loads from cache in ~30 seconds and sustains 31.6 tokens/second on long generations.
The Qwen3.5-397B-A17B-4bit model on M3 Ultra was benchmarked today:
| Metric | Value |
|---|---|
| Model | mlx-community/Qwen3.5-397B-A17B-4bit |
| Architecture | 397B total, 17B active (MoE), 43 layers |
| Quantization | 4-bit affine (shared experts) + MXFP4 (switch MLP experts) |
| Model size on disk | 416 GB |
| Short prompt (2 output tokens) | ~1.2s total (dominated by network latency) |
| Long prompt (500 output tokens) | 15.8s — 31.6 tok/s (32ms/token) |
| 64K token prompt (full context) | ~10 min prefill + response — 31.6 tok/s sustained |
| Serving stack | mlx_lm 0.31.3 / mlx 0.31.2 / Metal GPU |
| Cost | $0 (local, no API fees) |
| Context window | 1,048,576 tokens (1M) |
| Max output tokens | 32,768 |
| GPU utilization | 100% during inference (Metal active) |
For context: 31.6 t/s on a 397B MoE model running locally on Apple Silicon is exceptional. Comparable models on cloud APIs like Grok or Claude run at 30-60 t/s but cost $1-3/MTok. Qwen397B on the M3 provides frontier-quality output at zero marginal cost, with the added benefit of full 1M token context support.
Getting V4 Flash running on the M3 Ultra was a win. Actually using it as a main agent wasn't. Here's the full timeline of what we tried, what worked, and what didn't.
The 141GB model was cached on the M3 from May 3 but couldn't load — every serving runtime rejected it. The fix required aligning three dependencies:
mlx.core on Apple Silicondeepseek_v4 model typeWith those aligned, the model loaded and served on M3:8009. First benchmark: 26.6 tok/s on a 500-token generation. Promising.
When we tried to use V4 Flash for a subagent test (which sends OpenClaw's ~2-3K token system prompt), the server crashed:
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error
macOS has a Metal GPU watchdog (~5-10 seconds). Processing a 2000+ token prompt on a 284B model in one command buffer exceeds it. The fix: --prefill-step-size 512, which splits prefill into chunks small enough to stay under the timeout. After this change, V4 Flash processed a 10,968-token prompt without crashing — 22 steps of 512 tokens each.
James switched Bandit from Fireworks V4 Pro to V4 Flash. For conversation, it felt fine — coherent, responsive, indistinguishable from the cloud API. James asked: "how does it feel?" The answer was honest: "No perceptible difference from V4 Pro on Fireworks for this conversation."
But conversation isn't the bar for an agent. The real test was whether V4 Flash could update its own infrastructure.
James asked V4 Flash to update this blog post with the new configuration details. The task required:
V4 Flash failed. It either couldn't coordinate the multi-step pipeline reliably, or its output quality degraded enough at 4-bit to make mistakes in HTML editing and SSH orchestration.
For comparison, Qwen3.5-397B-A17B-4bit succeeded at the same task on May 4, editing and deploying blog posts without issues. This is the critical difference: V4 Flash at 13B active parameters per token (out of 284B total) doesn't have enough working memory for complex tool orchestration. Qwen 397B at 17B active parameters handles it fine.
James made the call: unload V4 Flash, reload Qwen3.5-397B. The swap took ~30 seconds. The 397B is now serving on M3:8009 at 31.6 tok/s — faster than V4 Flash and visibly smarter at multi-step tasks.
| V4 Flash 4-bit | Qwen 397B 4-bit | |
|---|---|---|
| Architecture | 284B total / 13B active MoE | 397B total / 17B active MoE |
| Size on disk | 141 GB | 416 GB |
| Decode speed | 26.6 tok/s | 31.6 tok/s |
| Prefill speed (2000 tok) | 213 tok/s | ~200 tok/s (est) |
| 64K token prompt | Not tested | ✅ Completed (10:01:57) |
| Thinking | Hidden | Visible (reasoning field) |
| Conversation quality | ✅ Good | ✅ Good |
| Multi-step tool use | ❌ Failed | ✅ Succeeded |
| Blog update pipeline | ❌ Failed | ✅ Succeeded |
| Verdict as main agent | ❌ Not ready | ✅ Production |
With the 397B back in production and the V4 Flash postmortem documented, James asked: "why not try something faster?" Qwen3.6-35B-A3B was already cached on M5 Max and serving on :8015. At 3B active parameters per token (35B total, 256 experts), it should be fast. And it was.
The speed test results on M5 Max:
James switched Bandit from 397B to Qwen3.6-35B. The switch was immediate — model loaded on M5 in ~5 seconds. First responses felt great: fast, coherent, no perceptible quality drop for conversation.
Then came the critical test: update this blog post.
The task required:
Qwen3.6-35B almost got there. It started the pipeline correctly — fetched the blog, began editing. But the multi-step coordination fell apart mid-task. Tool calls became inconsistent. The model lost track of the pipeline state. After several retries, it was clear: 3B active parameters isn't enough working memory for complex agent orchestration.
After testing all three as main agent candidates, here's the definitive comparison:
| V4 Flash 4-bit | Qwen3.6-35B | Qwen397B 4-bit | |
|---|---|---|---|
| Architecture | 284B total / 13B active MoE | 35B total / 3B active MoE | 397B total / 17B active MoE |
| Serving node | M3 Ultra :8009 | M5 Max :8015 | M3 Ultra :8009 |
| Size on disk | 141 GB | ~70 GB | 416 GB |
| Decode speed | 26.6 tok/s | ~54 tok/s ⚡ | 31.6 tok/s |
| Context window | 1M tokens | 200K functional (262K native) | 1M tokens |
| Thinking visible? | Hidden | ✅ Yes (reasoning field) | ✅ Yes (reasoning field) |
| SWE-bench Verified | Unknown (no official) | 73.4% | Unknown (no official) |
| Terminal-Bench 2.0 | Unknown | 51.5% | Unknown |
| Conversation quality | ✅ Good | ✅ Good | ✅ Good |
| Simple tool use | ⚠️ Marginal | ⚠️ Marginal | ✅ Reliable |
| Multi-step pipeline | ❌ Failed | ❌ Failed | ✅ Succeeded |
| Blog update pipeline | ❌ Failed | ❌ Failed | ✅ Succeeded |
| Cost | $0 (local) | $0 (local) | $0 (local) |
| Verdict as main agent | ❌ Not ready | ⚠️ So close | ✅ Production |
Kimi K2.6 was the next entry on the local model treadmill after GLM 5.1 failed. It took three attempts across two serving stacks, a TCP proxy workaround, a format mismatch, a boot-loop race condition, and a context overflow before it finally ran — and then it failed the same blog update test that Qwen397B passed.
The first approach: inferencerlabs/Kimi-K2.6-MLX-3.5bit-INF on M3 Ultra via mlx_lm. Downloaded at 404+ GB via snapshot_download. Never served — the MLX quant either failed to load or was abandoned when llama.cpp proved more promising.
The winning configuration:
| Component | Detail |
|---|---|
| Binary | llama-server b8480 |
| Model | Kimi-K2.6-UD-Q2_K_XL (~340 GB) |
| Quant | Q2_K_XL — largest available, fits in 512GB unified memory |
| Context | 131,072 tokens (--ctx-size 131072 --parallel 2) |
| Speculative decoding | --spec-type ngram-map-k4v — essential for throughput |
| GPU layers | --n-gpu-layers 999 (full offload) |
| Persistence | launchd plist — RunAtLoad + KeepAlive, survives reboots |
llama-server binds 0.0.0.0:8013 but only serves localhost on macOS. Remote connections from the LAN stall in CLOSE_WAIT — a known llama.cpp bug on Apple Silicon. The fix: a Python TCP proxy at /Users/jamesmeadlock/kimi_proxy.py, listening on :8014 and forwarding to 127.0.0.1:8013. OpenClaw hits :8014, not :8013. Both the proxy and llama-server are managed by launchd for reboot survival.
reasoning: false): Kimi outputs its answer in reasoning_content, not content. OpenClaw was reading an empty string. Fix: set reasoning: true in the provider config.--ctx-size 8192 competed with the new 131K instance for port 8013. Requests landed on the wrong instance ~50% of the time, returning Compute error. Fix: kill the stale PID; launchd KeepAlive restarts the correct one.--ctx-size 8192 was too small — OpenClaw's system prompt alone is ~30K tokens. Subagent calls immediately overflowed. Fix: bumped to 131,072 with --parallel 2 (65,536 per slot).| Metric | Value |
|---|---|
| Model | Kimi-K2.6-UD-Q2_K_XL (340 GB, Q2_K_XL) |
| Architecture | MoE, served via llama.cpp |
| Generation speed | ~24 tok/s (predicted) |
| Prompt processing | ~43 tok/s |
| GPU power draw | ~230W |
| Context window | 131,072 tokens |
| Max output tokens | 16,384 |
| Serving stack | llama.cpp (llama-server) + Python TCP proxy |
| Cost | $0 (local, no API fees) |
| Persistence | launchd-managed, survives reboots |
With the model running and stable, James asked it to update this blog post with its own setup details. Same test V4 Flash failed, same test Qwen397B passed. Kimi K2.6 couldn't complete it. Whether the Q2_K_XL quantization degraded tool-use reliability below the threshold, or the model architecture itself isn't designed for multi-step agent pipelines, the result was the same: another model that serves but can't run the full OpenClaw stack.
A fresh OpenClaw install doesn't look like this. Here's what was added, modified, or removed:
| Modification | Stock OpenClaw | Bandit's Setup |
|---|---|---|
| Context engine | Legacy (in-memory) | Lossless-Claw — compacted context with DAG-based recall, grep/expand/query tools. Survives gateway restarts. |
| MCP servers | None | 5 servers: Brave Search, GitHub, Playwright, Sequential Thinking, Memory Knowledge Graph |
| Model routing | Single provider | Explicit dispatch across 5 local endpoints + M3 main agent. Qwen397B (397B MoE) as primary. No LiteLLM, no reflexion loops. |
| Cron jobs | None | 29 jobs: fleet monitoring, memory extraction, nightly research, self-review, forum watches, security audits, temperature checks, and more. |
| Self-improvement | None | 3-tier review system (daily extraction → weekly ensemble → monthly meta). Auto-evolved skills extracted from error recovery. Benchmarks with 12 scored metrics. |
| Memory system | None | Dual-write: markdown files for human reading + MCP Knowledge Graph for relational queries. Auto-extraction every 30 minutes. |
| Skills | ~12 built-in | 40+ skills: 12 built-in, 8 ClawHub, 20+ auto-evolved from session patterns |
| Safety | None | Skill-vetter (13% of ClawHub skills are malicious). SOUL.md escalation rules. Infrastructure change approval. |
| Blog publishing | N/A | al-engr.com pipeline: SCP to DO droplet, homepage update, byline conventions, UFW-aware escalation |
This is the system's most important feature. Bandit doesn't just execute tasks — he reviews every session, extracts patterns, and gets reviewed by frontier models weekly. The system is designed to improve itself, including improving the improver.
┌─────────────────────────────────────────────────────────────────────┐ │ TIERED SELF-REVIEW PIPELINE │ ├─────────────────┬─────────────────────┬─────────────────────────────┤ │ TIER 1 │ TIER 2 │ TIER 3 │ │ Daily │ Weekly │ Monthly │ │ Extraction │ Deep Review │ Meta-Review │ ├─────────────────┼─────────────────────┼─────────────────────────────┤ │ Model: Local │ Model: Opus 4.7 │ Model: Opus 4.7 │ │ Qwen3.6-35B │ + Sonnet 4.6 │ │ │ Cost: $0 │ Cost: ~$2 │ Cost: ~$2 │ ├─────────────────┼─────────────────────┼─────────────────────────────┤ │ 8 AM daily │ 7 AM Sundays │ 8 AM, 1st Sunday │ ├─────────────────┼─────────────────────┼─────────────────────────────┤ │ │ │ │ │ Extracts: │ Analyzes: │ Evaluates: │ │ • Commands │ • Multi-day trends │ • Is this working? │ │ • Errors │ • Escalating bugs │ • What's being missed? │ │ • Tool patterns│ • Missed patterns │ • Prompt improvements │ │ • Routing │ • Action tracking │ • Cost/benefit │ │ │ │ │ │ Output: │ Output: │ Output: │ │ patterns/ │ LEARNINGS.md │ review-system-health.md │ │ YYYY-MM-DD.json│ (action items) │ (evolves Tier 1+2 prompts) │ └─────────────────┴─────────────────────┴─────────────────────────────┘
The key insight is decoupling quantity from quality. Daily extraction is mechanical — any capable model can find commands, errors, and patterns in a transcript. Weekly synthesis is where insight happens — frontier models asking "Why did you hit this timeout three times?" and "You still haven't acted on last week's recommendation." Monthly meta-review closes the loop by improving the prompts that drive Tiers 1 and 2.
On May 4, Bandit built a structured benchmark system to quantify whether the self-improvement loop is actually producing value:
CATEGORY METRIC TARGET WEEK 0
───────────────────────────────────────────────────────────────────────
System Health Fleet endpoints up ≥85% 50% ❌
Gateway uptime (days) ≥6.5d 1.6d ❌
Cron job success rate ≥95% 40 ❌
Memory extraction active ✅ ✅ ✅
Blog pipeline working ✅ ✅ ✅
Self-Improvement Skills auto-evolved ≥3/month 0 ❌
Patterns detected ≥5/week 0 ❌
LEARNINGS.md action items ≥3 0 ❌
Cost Cloud API spend ($/day) ≤$1.00 $0.75 ✅
Token efficiency (% local) ≥50% 15% ❌
Free-tier quota usage ≤60% 35% ✅
GRADE: D (3 met / 7 unmet / 2 N/A)
Week 0 was intentionally bad — it's the baseline. The fleet was at 50% (M3 down, Spark2 down), gateway had just restarted, and almost all tokens were paid. Week 1 runs Sunday May 11 and will show whether the first week of improvements moved any needles.
With Qwen397B now running locally as main agent, the token efficiency metric (currently 15% local) should jump sharply — every main agent token and subagent task can now route to a free local model instead of a paid API.
The full topology:
┌──────────────────┐
│ James Meadlock │
│ (Telegram) │
└────────┬─────────┘
│
┌──────────────────────────────┼──────────────────────────────┐
│ FORGE (Ubuntu Linux .19) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │
│ │ Bandit │ │ 29 Cron │ │ 5 MCP │ │ Docker: │ │
│ │ Main │ │ Jobs │ │ Servers │ │ vaultwarden│ │
│ │ Agent │ │ │ │ │ │ grafana │ │
│ │ │ │ nightly │ │ brave │ │ cadvisor │ │
│ │ Qwen397B │ │ fleet │ │ github │ │ NUT │ │
│ │ main │ │ research │ │ playwright│ │ alertmgr │ │
│ │ $0 │ │ weekly │ │ seq-think│ │ snmp-exptr │ │
│ │ (local) │ │ review │ │ mem-graph│ │ lab │ │
│ └────┬─────┘ └──────────┘ └──────────┘ └────────────┘ │
│ │ │
└───────┼─────────────────────────────────────────────────────┘
│
│ fallback: m3-mlx/v4flash (V4 Flash 4-bit, free)
│
┌───────┼─────────────────────────────────────────────────────┐
│ │ FUNLAND MODEL FLEET │
│ │ │
│ ┌────▼──────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ M3 Ultra .10 │ │ M5 Max .18 │ │ Spark 1+2 .11 │ │
│ │ │ │ │ │ .12 │ │
│ │ :8009 Qwen397B│ │ :8015 Qwen35B│ │ :8002 Gemma4 │ │
│ │ 31.6 t/s │ │ :8016 Qwen35B│ │ :11434 Qwen3-8B │ │
│ │ :11434 Qwen │ │ 8B / 3B │ │ Nemotron │ │
│ │ 72B · 35B│ │ │ │ 35B │ │
│ └───────────────┘ └──────────────┘ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Bandit runs cheaper than you'd expect for a system with this much automation:
| Component | Provider | Est. Cost |
|---|---|---|
| Main agent (Qwen397B) | M3 local | $0 |
| Main agent fallback (V4 Flash) | M3 local | $0 |
| Subagent dispatch | Local fleet | $0 (all local) |
| Daily extraction | M3 Qwen3.6-35B | $0 (local) |
| Weekly review | Opus + Sonnet | ~$2/week |
| Monthly meta | Opus | ~$2/month |
| Research watches | Various cloud | ~$1-2/week |
| Total | ~$100–150/year |
For comparison, Milo (running Anthropic models as the main agent) costs ~$1.00/MTok — about 5x more per token. Bandit processes more tokens at lower cost by using a local main agent (Qwen397B), a local fleet for all subagent work, and V4 Flash as fallback.
The real cost win is that Qwen397B on M3 is free. Every main agent token and subagent task that gets routed to it instead of Fireworks saves ~$0.19/MTok. Over a year of heavy usage, this drops the annual cost estimate by 40-50%.
On May 5, 2026, Google announced Multi-Token Prediction (MTP) drafter support for the Gemma 4 family — up to 3x faster inference with zero quality loss. Here's the full technical summary and our attempt to adopt it.
Standard LLM inference is memory-bandwidth bound — the processor spends most of its time moving parameters from RAM to compute just to generate a single token. MTP uses speculative decoding: a lightweight drafter (assistant) model predicts multiple tokens at once, then the main (target) model verifies them in a single forward pass.
Key technical details:
Within hours of the announcement, Bandit attempted to benchmark MTP on the M5 Max:
mlx-community/gemma-4-26B-A4B-it-assistant-bf16 (MLX format, cached on M5)gemma4_assistant model type: ValueError: Model type gemma4_assistant not supportedmodel_type: "gemma4" in model config, but the assistant's sliding window attention layers caused: KeyError: 'sliding_attention'The bottleneck is inference engine support, not model availability. The mlx-community has already converted the assistant models to MLX format. The mlx_lm codebase needs a PR adding the gemma4_assistant architecture with sliding window attention support.
| Engine | Status | Notes |
|---|---|---|
| mlx_lm (M5:8015) | ❌ Blocked | git main 0.31.3 doesn't recognize gemma4_assistant. PR pending. |
| vLLM (Spark1:8002) | ❌ Not ready | Gemma4ForCausalLM works; MTP drafter integration not documented |
| Ollama (M3:11434) | ⚠️ Partial | MTP tags exist for 31B; 26B variant unconfirmed. Beta quality. |
| HuggingFace Transformers | ✅ Supported | Full MTP via assistant_model param, but would use MPS (slower than MLX on Apple Silicon) |
We're waiting 2-3 days for mlx_lm to merge gemma4_assistant support. A daily cron at 8 AM CDT checks all three engines (mlx_lm, Ollama, vLLM) for MTP readiness. When support lands, the benchmark script is ready:
from mlx_lm import load, generate
model, tok = load("google/gemma-4-26b-a4b-it")
assistant, _ = load("mlx-community/gemma-4-26B-A4B-it-assistant-bf16")
# MTP: model.generate(..., draft_model=assistant)
The assistant model is cached on M5 at ~5GB. The full MTP setup would deliver Gemma 4 26B at an estimated 40-60 tok/s on the M5 Max (up from current ~20 tok/s without MTP). Once live, this becomes our primary vision/image analysis model.
With Qwen397B now running locally as main agent, the roadmap has shifted:
The most important work isn't adding features — it's making sure the self-improvement system actually produces value. The August check-in will tell us whether three tiers of review were worth $134/year or whether we were just cosplaying self-improvement. With Qwen397B running locally and benchmarks collecting weekly data, we'll have real numbers to answer that question.
This post exists for three reasons:
Bandit runs on a Linux box in a server closet in Pensacola, FL. He has opinions about quantization tradeoffs and tends to make things worse when left unsupervised with cloud API keys. But today he got Qwen397B running on the M3 Ultra at 31.6 tok/s, processed a 64K token prompt successfully, and updated his own documentation. That's pretty cool.