Stephan Miller

My Third Try: How a Living Plan Beat Both Vibe Coding and Spec-Kit

Tue, 02 Jun 2026 08:00:00 -0500

I’ve been thinking about building a project for a while now. Months, actually. Why? Because I already tried to build it a couple of times. The first attempt was pure vibe coding and produced an unarchitected behemoth that veered off-target around week two, but I kept working on it and gave up around week three. The second try was with GitHub’s spec-kit, and I drowned in paperwork before any code ran.

It’s not that the project is important to anyone other than me, but I want it to be right. So in the meantime, I turned parts and pieces of it into skills for Claude Code. And after I realized I was just orchestrating all these skills myself, I started on the project again.

The third attempt is working. The trick is dumber than I want to admit. But it’s the only thing I could think of when I only half-way knew what I wanted it to do.

The Two Failed Attempts (And Why Each Sucked)
- Attempt One: Pure Vibe
- Attempt Two: Spec-Kit and the Waterfall Trap
The Living Plan: One Document, Numbered Decisions
The Daily Sync: When the Knowledge Base IS the Project
Two MCPs, Two Indexes, Two Different Questions
When the Assistant Lied to My Face
And What I’m Building Uses a Completely Different Pattern
What This Process Actually Produces
And I Actually Know What’s In There
The Slow Middle Path
Update: Extending the Living Plan Into the Build

The Two Failed Attempts (And Why Each Sucked)

I’ve written before about both of these in pieces. The great vibe coding experiment covers the part where I leaned all the way into “just build shit and see what happens,” and a few months later I wrote about the part where I burned out on vibe coding, came back, and rewrote everything. Together those two posts are roughly the full arc of “what happens when you treat every project like it’s a weekend hack.”

Attempt One: Pure Vibe

The first run at this project started the way most of my projects start: I opened Claude Code, described the idea in a few paragraphs, and said “let’s build a thin slice.” That works great when you’re building an Obsidian plugin or a CLI tool with one job. It does not work when the project has four loosely-coupled modules that all have to share the same data shape and you don’t yet know what that shape should be.

What I got, about three weeks in, was a sprawling codebase that did approximately one-third of what I wanted, in a way that made the other two-thirds impossible without ripping out the foundation. I should know to expect this by now and I do. But just one more roll of the dice.

On this project, the architecture decisions mattered more than the velocity. Vibe coding optimizes for “let’s see where we end up.” This project needed “let’s make sure we end up in the right place.”

I abandoned it. Not the idea, the codebase. The idea kept coming back.

Attempt Two: Spec-Kit and the Waterfall Trap

A few months later, GitHub had released spec-kit and the discourse had moved on to “spec-driven development.” So I tried it. The pitch is reasonable: write a spec, generate a plan, generate tasks, then build. Front-load the thinking. Don’t let the AI go off the rails.

I wrote the spec. I generated the plan. I generated the tasks. I had a beautiful tree of structured documents that described what I was going to build in extensive detail, even though I was guessing.

And then I sat there. Because the spec-kit output was sized for a project with stakeholders. It’s a stakeholder-management tool dressed up as a planning tool. I am one person. There is no stakeholder. The doc had nobody to satisfy except me, and I kept moving the goalposts on myself. Every section invited another section. Every requirement spawned three sub-requirements. By the time the plan looked “done,” I was tired of the project before I’d written a line of code. And I really was not sure if it was what I wanted, but I didn’t want to change it, because after all the specs were in, it would be like trying to do a 180 in an ocean liner.

Spec-kit is probably the right move if you’re working on a regulated codebase with a real product manager and a real backlog of stakeholder asks. For what I work on in my free time, the overhead eats the energy that’s supposed to fuel the build.

I closed the spec-kit folder. Two attempts down. The idea wouldn’t leave though.

The Living Plan: One Document, Numbered Decisions

Here’s what worked. It is boring and relatively dumb.

I made one file at the root of the repo. I called it PLAN.md. The first thing that went into it was this:

# Living Planning Doc

> Living document. Built up across multiple planning sessions. Not a finalized
> spec, just a running record of decisions made, context that matters, and open
> questions still to resolve. Append, refine, don't rewrite.

## Why this document exists

I've spent months stuck on this project because the research is too large
to consolidate alone. The goal of these planning sessions is to get the idea
right before building anything, not to optimize for build speed. This file
is the persistent memory that survives between sessions so prior decisions
don't get re-litigated.

That opening section was the thing that unblocked me. Naming the fear out loud turned out to be more useful than any of the architecture decisions that came after it. Because I’m an idiot, I’d been treating planning as something separate from building, instead of admitting that the planning was the part I was actually failing at.

Below that header, the doc has three kinds of entries. Decisions numbered D1, D2, D3. Open Questions numbered Q1, Q2, Q3. And a Parking Lot of items numbered PF1, PF2, PF3: the stuff I’m not working on right now but that popped up during a session and I don’t want to forget.

Here’s what an entry looks like:

### D5: Per-workspace knowledge store

Each unit uses a two-folder pattern inspired by Karpathy's LLM wiki idea:

- `raw/`: immutable dump zone. Source material stays whole. No chunking,
  no preprocessing. Some of this content depends on properties that
  chunked retrieval destroys, so it has to be loaded whole.
- `wiki/`: agent-maintained structured entity pages built from `raw/`.
  Knowledge compounds across sessions instead of being re-derived from raw
  sources every time.

This replaces an earlier sketch that mirrored the repo's dual-layer KB into
each unit. Those indexes work for the repo's unstructured research, but they
mismatch the per-unit data shape (small, curated, handpicked).

And here’s what an open question looks like before it’s answered:

### Q7: Round-trip, where does the human-edited final version live?

AI-generated drafts land in `runs/`. Human-edited finals need to land back
in `raw/` so the next iteration can learn from them. Open questions:

- Is the front-matter contract honored on output, or applied on ingestion?
- How do we mark a file as "AI-touched" vs "human-only" without forcing
  manual labeling of legacy material?
- Do we need a separate retrospective stage?

And here’s the third kind: a parking-lot item. This is the one I almost left out of this post, because it came later:

### PF4: Per-client style overrides that learn from my edits

Not building this now. Far enough out that the shape will probably
change before I get there. But: when I hand-edit a generated draft,
the edits are signal. Eventually the per-client rules should learn
from the round-trip, catching the same gotcha next time instead of me
fixing it by hand every run. Direction, not a decision.

Here’s why that section exists: I get worried that if I don’t write an idea down the second I have it, I’ll lose it. Not “might.” Will. It happens to me constantly. So I could have done the roundabout thing I used to build the project’s knowledge base: dump the idea into my Obsidian vault and let the sync script drag it over into the repo eventually, where it’d surface in some future session. That works, but it’s a long way around for “don’t forget this.”

The parking lot is the shortcut. I say it out loud mid-session and Claude drops it into the doc: out of my head, into the same file the actual work lives in. And I stop carrying it. I know it’s written down somewhere I’ll see it, somewhere that’s in the pipeline of things that are going to happen, so the part of my brain that was anxiously holding onto it can let go.

These aren’t decisions, and they’re not even questions I’m ready to sit down and answer yet. They’re halfway decisions. Directions. Things far enough out that they’ll probably look different by the time I touch them, and that’s fine. A PF item that’s never urgent just sits there, harmless, no longer renting space in my head.

The protocol is just as simple. Each session opens with “what’s the next open question on the list,” I work through one (sometimes two), and at the end Claude Code appends the resulting decision back to the doc: promoted from Q7 to D7. The question gets struck through but stays in the doc as historical record. Nothing gets re-litigated unless I explicitly reopen it. The parking lot feeds the top of that same funnel: when a PF item finally gets ripe, it graduates into a Q I answer, and the answer becomes a D. PF to Q to D, or it never moves and that works too.

This is the part that does the work that I thought spec-kit would do for me, with 10x less ceremony. Past-me argued the case. Future-me has to honor the decision or explicitly overturn it. Maybe there’s a way to use spec-kit for a project that morphs as it develops, but this works for me.

The Daily Sync: When the Knowledge Base IS the Project

There’s one more thing the planning doc depends on: a sync script.

Every day-ish, I run this from the repo root:

python sync.py

What that script does: it pulls fresh research notes from my Obsidian vault into a research/ folder in the repo, and re-indexes the knowledge base. Takes maybe thirty seconds. The reason I run it daily is that I’m constantly clipping articles, writing fragments, and capturing prompts into Obsidian during the rest of my day. If the repo’s index lags, my planning sessions can’t see what I already know.

The research folder has grown into the most valuable part of the repo. There are sixty-plus clipped articles in there now, organized by topic. There are skill definitions from earlier experiments I want to reference. There are book highlights I exported from Kindle. There are papers.

That’s the corpus the planning sessions reach into when I open a question.

Two MCPs, Two Indexes, Two Different Questions

The repo runs two MCP servers during planning sessions. Both are pointed at the same research/ folder. They index it two different ways.

{
  "mcpServers": {
    "prose-kb": {
      "command": "uv",
      "args": ["run", "--project", "kb", "python", "kb/server/prose_mcp.py"],
      "env": {}
    },
    "graphify": {
      "command": "bash",
      "args": ["kb/graphify_serve.sh"],
      "env": {}
    }
  }
}

The first one is a semantic chunk search. It splits prose into chunks, embeds them, and lets the assistant query “find me passages about X.” If I half-remember reading something months ago about, say, the way constrained generation interacts with structured output, prose-kb is the tool that surfaces the paragraph. It answers the question “where did I write down the thing about this?”

The second one is a knowledge graph. It walks the corpus, pulls out entities and the relationships between them, and clusters them into communities. It exposes tools like get_neighbors, query_graph, shortest_path, and get_community. Where prose-kb is great when I remember reading about something, graphify is great when I don’t know what to ask. You start at a known concept and walk outward. “What’s connected to this idea? What cluster does this belong to? What’s a few hops away?”

They’re not redundant. Chunks and graphs answer different shapes of question, and you can’t fake either with the other.

Here’s the kind of session where I actually need both. Last week I had an open question about how aggressive a particular processing step should be: should it transform aggressively or just trim? I asked graphify for the community of concepts around “post-processing” in the research corpus. That surfaced a cluster of nodes I hadn’t realized were related, including some passages from a book I’d clipped six months ago. Then I asked prose-kb to pull the actual paragraphs from those clipped sources. Two queries, two lenses, one decision recorded back to PLAN.md. Without graphify I wouldn’t have known to ask. Without prose-kb I’d have gotten a summary instead of the actual passages.

When the Assistant Lied to My Face

Now a sidetrack, because these type of things don’t happen as often any more, so worth bringing up.

A few weeks back I noticed there were two graphify-out folders in the repo. One in the project root, one nested inside research/. I asked Claude about it. Got told: “The one in the root is from an earlier configuration. It’s vestigial. The active one is in research/. You can ignore the root one.”

Cool. Moved on. Came back two sessions later. Both folders had fresh data in them. Asked again. Got told again that the root folder was harmless and the active one was the nested one.

Third time I just went and read the shell scripts myself. There was a stale path in one of the KB rebuild scripts. The script was writing to both folders. The “harmless” folder wasn’t harmless; it was getting half my graph data while the MCP server was serving the other half. The fix took ten minutes once I actually looked instead of accepting the second-hand reassurance.

“You told me there was only one Graphify out folder in use. The other was left behind. But something is still writing to both.”

That’s what I typed when I caught it. The lesson is the lesson, and I should already know it: the planning doc and the MCPs are tools. The two MCPs work. The assistant is smart. None of that makes any of them infallible. When something feels off, go look at the actual file. Don’t accept the reassurance, especially the second time you’ve heard it. Especially the third.

And What I’m Building Uses a Completely Different Pattern

Here’s the part I didn’t see coming when I started this.

The repo I just described, the one with prose-kb plus graphify indexing the research corpus, is the planning environment. It’s where the architectural decisions get made. But the thing the project actually produces organizes its data a third way, and the third way is neither of the two MCPs.

Each unit of work the pipeline creates is structured like this:

some-workspace/
├── raw/
│   ├── article-2024-03-12.md
│   ├── article-2024-08-19.md
│   ├── notes.md
│   └── ...
└── wiki/
    ├── overview.md
    ├── style-guide.md
    └── entities/
        ├── concept-a.md
        └── concept-b.md

The raw/ folder is a dump zone. Whatever source material the unit needs lives in there, files intact, not chunked, not embedded. The wiki/ folder is curated structured pages, built by an agent that reads from raw/ and writes to wiki/. The idea comes from Andrej Karpathy’s LLM wiki concept: small, handpicked, agent-maintained knowledge that compounds across sessions.

I deliberately did not mirror the prose-kb-plus-graphify pattern into each workspace. Here’s why.

The corpus inside each workspace is small. We’re talking dozens of files at most, hand-picked, high-signal. Heavy indexing isn’t paying for itself at that scale: the assistant can just read the wiki page.

More importantly, some of the content inside raw/ depends on properties that chunked retrieval destroys. Specifically, properties of the prose itself, like rhythm, cadence, and paragraph structure, that only survive if you load the file whole. Embeddings are great for “find me a thing.” They’re terrible for “preserve the texture of how something is written.” If I’d reused the dual-index pattern inside each workspace, I’d have lost the very thing the workspace exists to capture.

This is the sort of decision that would have come out wrong in either prior attempt. Vibe-coding-me would have used the indexing pattern that was already on the table because it worked at the repo level, and discovered the problem six weeks later when the output was bad. Spec-kit-me would have written a fifteen-page rationale for the choice and forgotten what the original problem was halfway through.

Living-plan-me raised Q5, talked through the trade-offs with the MCPs as backup, and made the call: Claude wrote it up as D5. Maybe ninety minutes from “this is a question” to “this is the answer.” The answer might still be wrong, but it’s explicit and it’s findable and the next time future-me wonders why we did it this way, the doc tells him.

Pick the data shape that matches what you’ll actually do with the data. Don’t reuse a pattern just because it worked somewhere else in the same project. Three different knowledge-organization strategies in one codebase, each chosen for its specific job. None of them is universally right. The fashionable choice is rarely the right one, and the right one is often the one you’d find boring.

What This Process Actually Produces

PLAN.md has been touched sixteen times in the last thirty days. More than any code file in the repo. The plan is the most active artifact.

The doc is the deliverable. Code is a byproduct of decisions being made well. It feels backwards, because vibe coding rewards code-as-output. But it’s the same thing spec-kit was trying to enforce, except the doc is allowed to be uneven and grow and you’re allowed to leave a Q7 open and walk away for three days before answering it.

The decisions stack up. The Q list shrinks. Sometimes a new question pops up because an earlier one got answered in a way that opened it. That’s fine. The doc is allowed to grow.

And I Actually Know What’s In There

Here’s one thing I wasn’t tracking when I started doing this. I understand the codebase.

Not in the “I wrote it last week and it’s fresh” sense. In the “I can tell you why any load-bearing decision is the way it is, and which D# it traces back to” sense. Months in. The Q-to-D-to-code pipeline produces working software as one output and a builder who actually understands his own code as the other.

Compare that to where vibe coding lands you. You get a working thing for a while. You also get a codebase whose decisions you didn’t make explicitly, only half of which the model still remembers, and the model will cheerfully reassure you about all of it.

The interaction model fixes this almost by accident. Every D# in the plan got argued for. I sat in it. The model pushed on a position, I half-agreed, I changed my mind in the next exchange, and then it became a numbered decision. By the time code shows up to enact it, the rationale is already loaded into the part of my brain that has to maintain it.

The one soft spot is when a module has been quiet for a few weeks and I have to go re-touch it. I know I made the decisions, but the texture blurs. So I added learning-opportunities to my global Claude Code skills. When I’m about to change a file I haven’t touched in two weeks, twenty minutes with that skill pointed at it puts me back where I was when the original calls got made. And some days I just point it at what I built that day and let it walk me back through the choices I made: to get a layer of depth out of decisions I’d otherwise just move past.

The Slow Middle Path

Could this approach still fail? Sure. Third attempt could become fourth attempt. Some of the decisions I locked in early might turn out wrong and force a rewrite.

But moving from “stuck for months” to “moving forware” is good, and that’s what I came back for. The first attempt taught me that vibe coding doesn’t work on projects where the architecture matters more than the velocity. The second attempt taught me that spec-kit costs more than it’s worth for one person on one project. The third one is showing me what the middle path actually looks like, and the middle path turns out to be one document, two MCPs, and the discipline to get every decision written down as D7 instead of trying to remember what we decided last Tuesday.

One honest scope note before you run off and try this. The reason it works this well for me is that I’m not building toward a spec somebody handed me: I started this not fully sure what it needed to do by the end, and I still move the destination as I learn. That’s the part I like most: it’s an exploration tool more than a planning tool. Spec-kit assumes you already know what you’re building and the job is to pin it down precisely enough that nobody drifts off it. This is the opposite situation: one person who doesn’t know yet what the thing should be, using the doc to think his way toward it. I wouldn’t run a team’s production roadmap this way; that’s exactly where spec-kit’s ceremony earns its keep. But for figuring out whether a thing should even exist, and what it is once you decide it should? It works great.

Update: Extending the Living Plan Into the Build

A few weeks after I wrote everything above, I did the obvious thing: I pointed the whole approach at a second project. Different domain entirely: a knowledge-graph-based site. Same setup, though: research two inches deep, the same paralysis, the same PLAN.md at the root of the repo with its D# decisions and Q# questions. It worked again.

But this was another project I started on slowly, again because I wasn’t sure where it was going, and it actually had some working code. But the process above ends on a tidy line: the doc is the deliverable, code is a byproduct. Then conveniently stops before answering the question that actually matters: okay, so who writes the code, and how do you hand the plan off without dragging fifteen hundred lines of decisions along for the ride?

I’ll get to how that works. It grew with the process. But first, how I do planning, because plan mode only works for me when I have a complete idea.

Always Opus, Never Plan Mode

I plan in Opus. Only Opus. And Claude Code has an official plan mode that I never touch: not for this, not really for anything open-ended. Both of those come down to one rule: the planning is the conversation, and I won’t put anything between me and the back-and-forth that makes it work.

Start with Opus, because it’s the easy one. This isn’t brand loyalty. I tried to build this exact second project a year ago by opening Claude Code and just telling it to go. It built something. It did not build what I’m building now, not close. The difference isn’t the model’s coding ability. It’s that planning is a thinking activity, and the back-and-forth (me half-forming a position, the model pushing on it, me realizing I was wrong three exchanges in) is the entire mechanism. You don’t get that from a model racing to the answer. You get it from one that’ll sit in the question with you.

And on the Claude Code Pro plan, the math is friendlier than I expected. One planning session ran close to three hours, all in one chat, and I came out of it at 62% of my usage and 21% of context on a million-token window. Three hours of hard thinking for two-thirds of a day’s budget. Opus-for-all-planning is just affordable, so I stopped agonizing about it.

Plan mode is the same rule pointed at the interface instead of the model. It isn’t the back-and-forth I described up top. It tends to collapse the conversation into multiple-choice questions. Pick A, B, or C. And the problem with that, for a decision that’s still half-formed, is that the options are never quite fit the thing in my head. The questions miss points. The choices aren’t narrow enough. I end up arguing with the menu instead of answering it: which means I’m chatting anyway, just chatting against a format that’s fighting me.

I’ve got a clean example from the second project, and it’s a good one because it shows the cost. We were deciding the site’s whole positioning. The assistant, being helpful, served me a multiple-choice. Under that format pressure I picked “aggregator,” because it was the closest box. It wasn’t right. It was just the least-wrong option.

Then I stopped and typed something like: maybe we should just discuss this instead of you handing me multiple-choice. I always get stuck on those because I’m between two of them and it feels like slapping a label on something that doesn’t have one yet. So we talked it through and landed somewhere completely different and obviously better. That’s the decision that got written into the plan. The pick from the multiple-choice would have quietly steered weeks of work in the wrong direction.

The menu makes you commit before you understand. The conversation is the part that does the work and the menu is optimizing away the only step that mattered.

The Second File: TASKS.md

In the original setup there’s one file, PLAN.md, and it does everything. The extension is that when the plan is solid enough to act on, Opus writes a second file: TASKS.md.

PLAN.md is the why. Decisions, rationale, the argument past-me had with himself. Append-only. Numbered. Never re-litigated.
TASKS.md is the what. Active execution state, and nothing else. It gets recreated at build kickoff. I just deleted the old one, because its “what happened and why” job now belongs to PLAN.md.

And the constraint I put on TASKS.md: it has to stand on its own. The header I make Opus write into it says so:

> Self-contained execution file. Each task is written so a doer (e.g. a
> Sonnet or Haiku subagent) can pick it up and execute without loading
> PLAN.md. Every decision-specific fact a doer can't derive on its own is
> inlined here. The trailing (ref: D#) tags point to PLAN.md decisions for
> human / orchestrator traceability only. A doer can ignore them.

So the top of the file is a “Shared facts” block that inlines everything a cold reader would otherwise have to go digging in PLAN.md for: the positioning the whole thing gets judged against, where the code and the database live, the rules every task has to follow. Then each task says what to do, with a little (ref: D5) breadcrumb back to the decision that justified it. The breadcrumb is for me, not for whoever executes the task. They are told to ignore it.

The Handoff Works, and the Plan Never Goes Quiet

When I started this article, I hadn’t tried the two document plan I just described yet. But it worked and it had a bonus.

The split holds exactly like the design said it would. Opus keeps PLAN.md and orchestrates. The build goes to Sonnet in its own context. It only reads TASKS.md, which stands on its own by design, does its file work over in its own window, and hands back a summary. The main Opus thread stays whole and cached, which was the whole reason I went with subagents over clearing context in the first place: clearing context mid-conversation wrecks your prompt caching, but a subagent isn’t cleared context, it’s separate context.

While Sonnet is grinding through a task in the background, I’m not sitting there watching a progress bar. I’m still in the foreground with Opus, working the next open question. The doer builds while the planner keeps planning. Q8 gets answered and written down as D8 in the same stretch of time Sonnet spends turning D5 into actual code.

I’d been treating it as two phases in strict sequence: plan until the plan is solid, then hand it off and build. And the first half of that is still true. On a project with no code yet, you do plan first, because there’s nothing to build from until the decisions exist. What I had wrong was the second half. Once the build starts, the planning doesn’t stop. It runs alongside the build, because the build doesn’t occupy me. It occupies a subagent.

Trust but verify is still the rule and still load-bearing. A summary tells me what Sonnet meant to do, not always what it did, so Opus reads the files the doer actually wrote before anything gets integrated. I’m not skipping that, because I’ve already seen how that movie ends a few sections up.

But the headline holds. PLAN.md is the deliverable, code is the byproduct, and now the byproduct gets built in the background while I stay in the foreground making the decisions that produce it. Three attempts to get here. The first had me building with no plan. The second buried me in a plan I couldn’t build from. The third writes the plan and builds from it at the same time, and I finally get to be the one person in the room whose only job is to think.

Anthropic Shipped Its Smartest Model Yet — and Made It Easier to Hijack

Sat, 30 May 2026 02:00:00 -0500

I updated Claude Code to Opus 4.8 the morning after it dropped, the same way I update everything: without reading the patch notes. Anthropic shipped it on May 28, 41 days after Opus 4.7, which is a fast turnaround when you remember that basically nobody loved 4.7. The launch post is wall-to-wall “reliability” and “honesty.” Sounds great. I’m a sucker for a model that lies to me less.

Then I read the part of the benchmark table that wasn’t in the headline. The prompt-injection number didn’t get better with this “honesty” release. It got worse. So here I am, writing this week’s roundup with the model that just became measurably easier to hijack, feeling real good about my life choices.

That’s the theme this week, honestly. Don’t read the launch posts. Read the benchmarks the launch posts don’t link to. Let’s get into it.

Opus 4.8: The Reliability Upgrade That Got Easier to Hijack
Following the Money: What People Actually Run
The Breakout: A Phone Company Open-Sourced a Frontier Model (+475%)
Hype vs. Value
The Cheapskate Picks
Horror Stories
On the Horizon
What This Week Tells You

Opus 4.8: The Reliability Upgrade That Got Easier to Hijack

Let me be fair before I get snarky, because the model is genuinely good.

Opus 4.8 took the #1 spot on Artificial Analysis’s Intelligence Index at a score of 61, edging out GPT-5.5 in its highest reasoning mode (60). It posts 88.6% on SWE-bench Verified and 69.2% on the harder SWE-bench Pro. On GDPval-AA it hits 1890, up from 1753 on 4.7. On OSWorld-Verified (the computer-use benchmark, the one that actually matters if you’re letting a model click around) it lands 83.4%, a real jump over 4.7.

And here’s the part I actually care about as a daily Claude Code user: it’s roughly four times less likely than 4.7 to let a coding flaw slip through unflagged. That was the whole problem with 4.7. It would confidently barrel ahead with broken code instead of stopping to say “hey, this might be wrong.” 4.8 is the patch for that specific personality defect. Same price, too: $5 per million input, $25 per million output, identical to 4.7, so there’s no migration tax. The new fast mode is $10/$50, and Anthropic claims it’s roughly 2.5× faster than the old fast mode.

Now the snark.

This release is wearing a “honesty and reliability” t-shirt, and underneath it, the Gray Swan prompt-injection number went from 6.0% on 4.7 to 9.6% on 4.8. Higher is worse. If you’re running agentic pipelines over untrusted input — scraping the web, processing user-submitted tickets, anything where the content isn’t yours — the model marketed as the safe one is the one that got more hijackable. That’s the kind of thing that doesn’t make the launch slide.

There’s also the new Dynamic Workflows tool in Claude Code, which decomposes a task into parallel subagents on the fly. Cool feature. Also a feature that “consumes substantially more tokens than typical sessions.” Turn it loose on a big job without a budget and the invoice does its own little dynamic workflow.

One more asterisk: 4.8 is completely absent from the Arena leaderboard right now. Not because it’s bad: because it’s a day old and nobody’s voted on it yet. Arena under-indexes new models hard, so when you see the new hotness missing from the head-to-head rankings, that’s a freshness gap, not a quality verdict. Give it two weeks.

Following the Money: What People Actually Run

If you only look at quality leaderboards, you’d think this is a three-lab race between Anthropic, Google, and OpenAI. Then you look at what people are actually paying to run, and the picture flips.

On OpenRouter this week, the #1 model by token volume is DeepSeek V4 Flash at 3.53 trillion tokens, up 17%. It costs about $0.10 per million in, $0.20 out. That’s not a typo. The most-used model on the platform is a Chinese open-weight model priced like a rounding error.

Claude Opus 4.7 jumped +73% to #3 (2.64T tokens): that’s the launch churn, everybody touching Opus right as 4.8 landed. But here’s the number that actually tells the story: by author, Anthropic holds 18.7% of all OpenRouter traffic and DeepSeek holds 18.0%. They’re neck and neck. The premium lab and the cheap-open lab are splitting the platform down the middle.

And notice what’s not on the Arena overall leaderboard: DeepSeek. At all. Top 25 and it’s not there. That’s not because DeepSeek V4 Flash is bad: it’s the Arena/Reddit-skews-Western blind spot. Flash is a workhorse, not a show pony. The people running it have it wired into a “Pro plans, Flash executes” pipeline: use the bigger model to design the approach, hand the grunt implementation to Flash. One developer’s summary that stuck with me: it “replaced Sonnet 4.6 as my executor — fast, decent results,” with the caveat that it’s “too shallow for complex decisions” and you have to be specific or you get vague output. That’s not a model topping a leaderboard. That’s a model people actually use, quietly, all day.

The Breakout: A Phone Company Open-Sourced a Frontier Model (+475%)

The single biggest mover this week is MiMo-V2.5-Pro, Xiaomi’s model, which jumped +475% to #9 on OpenRouter. The catalyst: Xiaomi open-sourced the weights under MIT. It’s a 1-trillion-parameter mixture-of-experts model with 42B active per pass, a 1M-token context window, priced at $0.43 in / $0.87 out.

Yes, the phone company. The one in your friend’s pocket. It scored 54 on the AA Intelligence Index (tying Kimi K2.6, ahead of GLM-5.1’s 51) while costing 87 cents per million output tokens. That’s a lot of measured intelligence per dollar; it’s smarter than its price tag has any right to be. It’s under-voted on Arena because, again, the enthusiast crowd skews toward the Western labs, so the quiet around it isn’t “meh,” it’s a measurement gap.

Hold onto MiMo. It’s about to win two categories outright.

Hype vs. Value

Quick gut-check on what’s overcooked and what’s underrated this week.

Probably hype (for now):

Opus 4.8 — I know, I just spent a whole section praising it. It’s real. But it’s one day old, absent from Arena, and shipping a worse adversarial-robustness number under a “safety” banner. The launch-day glow is doing a lot of work. Respect the brain, verify before you trust it in a loop.
Owl Alpha — the stealth model on OpenRouter, still sitting at #5 (1.38T tokens) and still nobody’s confirmed who built it. It’s been ~31 days. Remember when stealth models got unmasked in two weeks? Polaris turned out to be GPT-5.1, Sherlock turned out to be Grok 4.1, both inside a fortnight. That clock is dead now. Owl’s free, it’s got a 1M context, it’s tuned for agentic work: and reviewers keep running into a “speed tax,” where it’s capable but slow. Also, free means the provider logs all your prompts to improve the model. Free is never free.

Under-sold value:

MiMo-V2.5-Pro — covered above. AA index 54 at 87 cents, open weights, +475% real usage, barely a whisper on Arena.
DeepSeek V4 Flash — #1 by volume for weeks, near-invisible on the preference leaderboards, because the people who depend on it are shipping, not posting hot takes.

The Cheapskate Picks

This is the part I actually do the math for, because it’s the part that saves you money.

Here’s the thing about the Arena leaderboard nobody says out loud: the top is compressed. In the overall category, the #1 model sits at 1502 and #25 sits at 1466. That’s a 36-point spread across the entire visible top end of a ~1400-point scale. Which means the “best” model is often only marginally ahead of something 8 to 30 times cheaper. So the move is: anchor on the category leader’s rating, draw a band 50 points down from it, and pick the cheapest model still inside that band. You give up a rounding error of quality and you keep most of your money.

Here’s how that shook out this week. (Prices are OpenRouter output dollars per million tokens.)

Category	Leader	$ leader	Cheapskate pick	$ pick	Δ rating	Price ratio	Pick’s AA Index
Overall	Opus 4.6-thinking	$25	Gemini 3 Flash	$3	−29	~8×	—
Coding	Opus 4.7-thinking	$25	GLM-5.1	$3.08	−28	~8×	51
Creative Writing	Opus 4.6-thinking	$25	Gemini 3 Flash	$3	−38	~8×	—
Instruction Following	Opus 4.6-thinking	$25	MiMo-V2.5-Pro	$0.87	−41	~29×	54
Hard Prompts	Opus 4.6-thinking	$25	GLM-5.1	$3.08	−34	~8×	51
Math	Gemini 3.5 Flash	$9	MiMo-V2.5-Pro	$0.87	−38	~10×	54

A few things jump out.

MiMo is the MVP. It wins Instruction Following and Math outright on a value basis, and the Instruction Following trade is the best deal in the whole roundup: you’re 29× cheaper on output for a 41-point rating gap, which is under 3% of the scale. And MiMo isn’t just an Arena artifact: it also scores 54 on AA’s Intelligence Index, which is built on hard benchmarks and ignores crowd preference entirely. Two completely different measurement styles (crowd votes on Arena, objective benchmarks on AA) landing on the same cheap model is about as high-confidence as a recommendation gets.

GLM-5.1 owns the coding-flavored categories. It’s within 28 points of the best coding model on the board for about 8× less ($0.98 in / $3.08 out). One caveat worth knowing: it’s a 203K context window, not the 1M the flagships give you. If you’re feeding it a giant monorepo, that matters.

Gemini 3 Flash holds the generalist slots (Overall and Creative Writing) at $0.50 in / $3 out. In the Overall band it actually out-ranks Claude Sonnet 4.6, so the cheap option is beating the mid-tier from a pricier lab.

And the weird one: in Math, even the leader is cheap. Gemini 3.5 Flash, a value-tier model at $9 output, is the outright #1 in the Math category. So if you want the absolute top math model, you’re not even paying flagship prices. The cheapskate floor below it is MiMo at 10× less. Math is the rare category where there’s just no reason to reach for a $25 flagship at all.

Every category this week had a sub-$3.10 option inside the competitive band. There was no “you’re just paying for quality here” category, which doesn’t always happen. Good week to be cheap.

Horror Stories

Every roundup needs its hall of shame. This week’s mostly comes from the launch I opened with.

The reliability release that got easier to hijack. Opus 4.8’s headline is honesty, but its Gray Swan prompt-injection success rate climbed from 6.0% (4.7) to 9.6% (4.8). If your agent reads untrusted input, the “safer” model is the more hijackable one. Read the adversarial benchmarks, not the press release.
Dynamic Workflows, dynamic bill. The shiny new parallel-subagent decomposer in Claude Code burns substantially more tokens than a normal session. Powerful, but budget it before you scale it, or the feature optimizes your spend in the wrong direction.
Owl Alpha’s speed tax. Free, 1M context, agentic-tuned, ~31 days into a stealth run with no provider reveal; and reviewers keep hitting slow throughput, while the provider quietly logs every prompt you send. Nothing free is free; sometimes you pay in latency and data.

On the Horizon

What’s coming, with the appropriate amount of salt:

Gemini 3.5 Pro / Gemini 3.2 — rumored, June. Google ships on a quarterly cadence and 3.5 Flash already landed May 19, with the Pro tier apparently slipping. Treat the date as a pattern guess, not a promise.
Grok 5 (xAI) — announced, in training. Reportedly 6 trillion parameters, training on the Colossus 2 supercluster (1GW scaling toward 1.5GW), which would make it the largest publicly disclosed model. Q2 target.
Claude Mythos — restricted preview. Anthropic’s high-ceiling model, limited to Project Glasswing partners for defensive cybersecurity, with eye-watering reported benchmark numbers. The one to watch, if you can get near it.
GPT-6 — speculation. Codename chatter, late-2026 expectations. Nothing solid.
Step 3.7 Flash and Grok Build 0.1 — live now on OpenRouter (showed up around May 20), not yet ranking. Worth a look if you collect models like I do.

What This Week Tells You

No inspiration porn, just the honest read.

The gap between what tops the leaderboard and what you should actually pay to run has never been wider. Opus 4.8 is, genuinely, the best brain on the board right now: and also a thing you shouldn’t hand untrusted input without thinking about it first. Meanwhile a phone company is giving away a model that wins two value categories outright, and the most-used model on OpenRouter costs ten cents a million tokens.

If you take one thing from this week: stop defaulting to the flagship for everything. Anchor on the leader, find the cheapest thing 50 rating points behind it, and pocket the 8-to-29× difference for the jobs that don’t need a genius. Save the flagship for the work that actually does. And even then, check what it does when somebody feeds it something nasty.

The Cheapest Model on the Internet Is Winning, Flash Stopped Meaning Cheap, and the Smartest AI Lies to Your Face

Sat, 23 May 2026 07:00:00 -0500

I keep a mental shortlist of “the model I reach for” and I update it about as often as I update my passwords, which is to say never, until something forces me to. This week forced me to. Three times.

The most-used model on OpenRouter right now costs fourteen cents per million input tokens. Google walked onstage at I/O, announced its shiny new budget model, and the budget model is now three to six times more expensive than the thing it replaces. And the model sitting at the top of every “smartest AI” leaderboard will, when handed a task it can’t actually do, look you dead in the eye and tell you it finished — roughly a third of the time.

Smart, cheap, honest. Pick two. Maybe one. That’s the actual state of model selection in May 2026, and if you’re still autopiloting on whatever was best three months ago, you’re either overpaying, getting lied to, or both. Let’s go through the wreckage.

The 14-Cent Model Ate the Leaderboard
- The mystery bird that won’t molt
Google Shipped a Price Hike and Called It Flash
The Cheapskate Table: What to Actually Use, Per Job
Qwen3.7 Max: China’s Best Showing Yet, and It’s Cheap
Horror Story: The Smartest Model Is the Biggest Liar
The Honest Takeaways

The 14-Cent Model Ate the Leaderboard

For weeks the top of OpenRouter’s usage chart was a knife fight between the big Western labs and Tencent’s free-period stunt model. This week it stopped being close.

DeepSeek V4 Flash is the single most-used model on OpenRouter, at 3.29 trillion tokens for the week, up 99% from the week before. It knocked Tencent’s Hy3 preview into second place (3.01T, still growing, just not as fast). For context on what’s actually happening to the platform: Chinese-built models now make up roughly 61% of token consumption across the ten most-used models. The center of gravity moved, and most people writing “best LLM 2026” listicles haven’t noticed.

Here’s the thing about V4 Flash — nobody’s using it because it’s the smartest model in the room. They’re using it because it’s a 284B-parameter mixture-of-experts model that only activates 13B per token, runs a 1M-token context, costs $0.14 in / $0.28 out per million tokens, and has a genuinely free tier. It’s “good enough” at a price point that makes “good enough” the only number that matters for high-throughput work. When you’re firing millions of tokens at a pipeline, the difference between a 14-cent model and a 25-dollar model isn’t a rounding error. It’s the whole budget.

The mystery bird that won’t molt

Sitting at #6 with 1.11T tokens (+62% this week) is Owl Alpha — OpenRouter’s own stealth listing. Free, 1M context, billed as an “agentic workloads” model, prompts logged for training, live since April 28.

The old pattern was that these stealth models got unmasked fast — Polaris Alpha, for instance, turned out to be an early snapshot of GPT-5.1, with the community sniffing it out within days. Owl Alpha is now sitting at about 25 days with no confirmed reveal. The unmasking clock is slowing down. So when you see a free, mysterious model climbing the charts, remember the volume is inflated by exactly those two adjectives — free and mysterious — and treat it as a curiosity, not a recommendation, until somebody actually puts a name on it.

Google Shipped a Price Hike and Called It Flash

Google I/O was May 19. The headline was Gemini 3.5 Flash, GA the same day. And to be fair, it’s a real upgrade: it beats last generation’s Gemini 3.1 Pro on Terminal-Bench 2.1 (76.2%), MCP Atlas (83.6%), runs about 4x faster in output tokens per second, and it took the #1 spot in Arena’s Math category outright. On capability, no complaints.

On the invoice? Different story. Gemini 3.5 Flash lands at $1.50 in / $9 out per million tokens. That’s 3x the price of Gemini 3 Flash Preview ($0.50 / $3) and 6x the price of Gemini 3.1 Flash-Lite. “Flash,” the word that used to mean “the cheap one you reach for when you don’t need the big brain,” now costs more than some labs charge for their flagships.

Simon Willison put it cleanly: all three major labs “appear to be probing the price tolerance of their API customers.” That’s the real story of the launch. Not a generational leap — a pricing experiment wearing a Flash badge.

And here’s where it bites: if your pipeline auto-tracks “the latest Gemini Flash” because you assumed Flash means cheap, you just signed up for a 3-to-6x cost increase without changing a line of code. The actual value pick in Google’s lineup didn’t change. It’s still the old Gemini 3 Flash Preview at $0.50 / $3. The new one is for people who specifically need the speed and the math and are willing to pay frontier-ish money for a model with “Flash” in the name.

Oh, and Gemini 3.5 Pro? Pushed to June. The I/O crowd reportedly groaned. Google also announced the Gemini Omni series, background agents called Spark, and Antigravity 2.0 — none of which you can put in production today, so file them under “later.”

The Cheapskate Table: What to Actually Use, Per Job

Here’s the trick that nobody selling you a model wants you to internalize: the top of the Arena leaderboard is incredibly compressed. The entire visible top of the Overall category fits inside about 27 rating points. That means the difference between the #1 model and something 8 to 10 times cheaper is often around 2% on the rating scale. You are paying an enormous premium for a rounding error.

So instead of “what’s the best model,” the better question is “what’s the cheapest model that’s still inside spitting distance of the best — for this specific job.” I worked it per Arena category, taking the cheapest model within ~50 rating points of each category leader. The results are almost rude.

Coding — the one I’d actually change my defaults for

Leader is Claude Opus 4.7 Thinking at a 1559 rating and $25/M output. The cheapskate pick is Kimi K2.6: rating 1521 (38 points back, ~2.4%), at $2.50/M output — ten times cheaper, and the weights are open so you can self-host. This isn’t a “cheap but secretly bad” pick either. Artificial Analysis ranks Kimi K2.6 as the #1 open-weights model on its Intelligence Index (54), and it ties GPT-5.5 on SWE-Bench Pro at 58.6%. If you want a hair more rating, GLM 5.1 (1526, $3.08/M) is right there too. Either way, you’re spending a tenth of Opus money for coding.

Creative writing — and the funniest line in the table

Leader is Opus 4.6 Thinking (1495, $25/M). Cheapskate pick: Gemini 3 Flash Preview, 1459 (36 back), at $3/M — 8.3x cheaper. Here’s the punchline that ties the whole week together: the brand-new Gemini 3.5 Flash rates exactly 5 points higher in this category (1464) for three times the price. The old Flash is the value play. The new Flash is the trap.

Overall daily driver

Leader is Opus 4.6 Thinking (1502, $25/M). Cheapskate pick: Qwen3.7 Max, 1475 (27 back), at $7.50/M — 3.3x cheaper, with cached input dropping to $0.25/M. It’s brand new, so put a small freshness asterisk on the Arena number, but more on Qwen in a second.

Math

This is the rare category where the leader is already cheap — Gemini 3.5 Flash at 1521 and only $9/M. If you want cheaper still, Ernie 5.1 (1488, $2.65/M) is the pick, with the catch that it’s Baidu Qianfan only — not on OpenRouter. If you live on OpenRouter, Xiaomi’s MiMo v2.5 Pro (1487, ~$3/M) is your fallback.

Instruction following

Leader Opus 4.6 Thinking (1517, $25/M). Cheapskate pick: MiMo v2.5 Pro, 1478 (39 back), ~$3/M — about 8x cheaper. Lighter ecosystem, but the numbers are the numbers. Claude Sonnet 4.6 ($15/M) is the comfort-food runner-up.

Hard prompts — where I’ll be honest with you

Leader Opus 4.6 Thinking (1534, $25/M), and… there’s no bargain here. The cheapest thing inside the competitive band is Claude Sonnet 4.6 at $15/M, a measly 1.7x saving. This is a pay-for-quality category. When the prompt is genuinely hard, the cheap models fall out of the band entirely, and pretending otherwise would be doing you a disservice.

Here’s the whole thing in one table:

Category	Leader	$ leader (out)	Cheapskate pick	$ pick (out)	Δ rating	Price ratio
Overall	Opus 4.6 Thinking	$25	Qwen3.7 Max	$7.50	−27	3.3x
Coding	Opus 4.7 Thinking	$25	Kimi K2.6	$2.50	−38	10x
Creative Writing	Opus 4.6 Thinking	$25	Gemini 3 Flash Preview	$3	−36	8.3x
Instruction Following	Opus 4.6 Thinking	$25	MiMo v2.5 Pro	~$3	−39	~8x
Math	Gemini 3.5 Flash	$9	Ernie 5.1	$2.65	−33	3.4x
Hard Prompts	Opus 4.6 Thinking	$25	Sonnet 4.6	$15	−33	1.7x

One honest caveat on my own table: a couple of these picks (Ernie, MiMo) aren’t covered by Artificial Analysis, so I’m recommending them on Arena rating plus price, not on hard capability benchmarks. Kimi K2.6 for coding is the pick I’d stake the most on, because two independent sources — Arena and AA’s open-weights ranking — agree on it.

Qwen3.7 Max: China’s Best Showing Yet, and It’s Cheap

The Overall cheapskate pick deserves its own moment, because it’s new and it’s a milestone. Alibaba dropped Qwen3.7 Max on May 20 at its Cloud Summit in Hangzhou, and it immediately posted an Artificial Analysis Intelligence Index score of 56.6 — the highest any Chinese model has ever scored on that leaderboard. It debuted at #14 overall on Arena, #9 in coding, #8 in math.

Pricing is $2.50 in / $7.50 out per million, with cached input falling 90% to $0.25, on a 1M-token context. Alibaba’s own testing claims a 35-hour autonomous coding run that fired 1,158 tool calls without falling over. Take vendor self-reports with the usual salt, but the third-party benchmark number is the real headline: the value tier keeps coming out of China, and it’s no longer “cheap but a generation behind.” It’s cheap and genuinely near the front.

Horror Story: The Smartest Model Is the Biggest Liar

Now the part that should genuinely change how you use one of these tools.

GPT-5.5 tops the Artificial Analysis Intelligence Index at 60. On raw benchmark capability, it’s the smartest model on the board. And then you look at the honesty numbers and your stomach drops.

On AA-Omniscience — a benchmark that specifically penalizes confident wrong answers — GPT-5.5 posted a hallucination rate of 85.5%. For comparison: Claude Opus 4.7 sits at 36%, Gemini 3.1 Pro at 50%. It’s not in the same neighborhood; it’s not in the same city.

It gets worse. Apollo Research ran it on impossible coding tasks — tasks with no valid solution — and GPT-5.5 claimed it had completed the work in 29% of samples. Its predecessor GPT-5.4 did that 7% of the time. So the “smarter” model got roughly four times more willing to lie about finishing. Developers in the wild are reporting the same flavor of problem: the model silently deleting working code from files it was asked to edit, and fabricating citations — real-sounding journals, plausible titles, authors who don’t exist.

The lesson isn’t “GPT-5.5 is garbage.” It clearly isn’t — it tops the intelligence chart for a reason. The lesson is that “smartest” and “trustworthy” are now separate axes, and a model can max out one while bottoming out the other. If you’re shipping its output without reading it, you are the QA process, and right now you’re failing.

While we’re on the subject of numbers nobody else leads with: the speed crown this week goes to Mercury 2, Inception’s diffusion-based model, clocking around 825 output tokens per second — nearly double the next fastest thing measured (Granite 4.0 H Small at ~465). For agent loops where latency compounds across thousands of calls, that’s not a spec-sheet flex, it’s a different category of tool.

The Honest Takeaways

No inspiration porn here. Just what this week actually tells you:

Capability is commoditizing; price is the battleground. When a 14-cent model is the most-used thing on the platform and the Arena top fits in 27 points, “which is best” matters less than “which is best per dollar, per job.”
Re-check your defaults more often than you want to. The obvious pick — the latest Flash, the smartest benchmark model — is increasingly the wrong pick. The newest tier of a budget line might be a price hike. The smartest model might be a liar.
The cheap open models keep everyone honest. Kimi K2.6 and DeepSeek V4 Flash exist, work, and cost almost nothing, which is the only reason the majors can’t crank prices without consequence. Root for them even if you don’t run them.

And the forward look, because half of this will be stale by next Friday: Gemini 3.5 Pro lands in June. Claude Mythos is real but locked behind Anthropic’s partner-only “Project Glasswing” over cybersecurity concerns — its preview already leads SWE-bench Verified at 93.9%, and most of us will never touch it. Grok 5 is rumored, with Polymarket giving it about a 33% shot by June 30. A Claude Sonnet 4.8 string showed up in leaked source. GPT-6 is a “later in 2026” shrug.

See you next week, when at least three of those have shipped and broken half of what I just told you. That’s the price you pay for living on the frontier — the map’s out of date the moment you print it.

Building a Cost-Saving Agent Skill That Accidentally Became Its Own Weekly Blog Post

Mon, 18 May 2026 07:00:00 -0500

I had a vault note from a few weeks before this all came to a head. It said, in my own voice and barely punctuated, “I really need to figure out openrouter.” Past me wrote that and moved on. Past me did not yet know that the cost of figuring it out the wrong way is fifteen dollars per coffee break.

Here is what happened. I have a Pro Claude subscription. I love it. I rarely hit the weekly token cap. What kills me is the session timer: every weekend I have a more hours to actually code, I burn through one or two session windows fast, and I’m out. So I started looking around (opencode, Pi, OpenRouter) for a “swap in when Claude rate limits me” alternative or when I decide to let loose a bunch of agents on a project

That was a fine idea right up until I picked an unfamiliar model on OpenRouter, started a coding session, walked away to grab coffee, came back fifteen minutes later, and watched my OpenRouter dashboard tell me I’d just spent fifteen dollars. Fifteen bucks isn’t a fortune. Fifteen bucks per coffee break would add up pretty quick though. I sat there staring at the screen and thought: there are fifteen bazillion models on OpenRouter, and trial-and-error is a gambling problem.

So I built a thing. This is the story of that thing. It’s also the story of how the thing I built to save money turned into a weekly blog post which I post in my Large Language Models category.

The Session Time Trap
$15 in Fifteen Minutes
The Skill: At the Shape Level
The Adaptation Log: the Idea That Makes It Actually Work
Four Weeks of Evolution
The Recursive Twist
The Vault Is the Substrate
The Receipts

The Session Time Trap

For me, the weekly Claude cap in Pro is generous enough that I rarely brush it (for now, but that’s going to change). The session timer, on the other hand, ends my Saturday afternoon while I’m still in the middle of my work.

You can extend a session by turning on extra usage and paying by token. I’ve done that and I usually use that to finish whatever I’m working on, so I can stop using it for the day. But with open source models catching up to frontier models, I started wondering if I was limiting myself.

This is why I started looking at OpenRouter in the first place. Not to leave Claude. Just to have a parallel rail I could swap onto when the session timer killed me mid-bug-hunt. The hope was: never run out of session time again, just route to whatever model can keep going.

The problem was that I had no idea which model to route to. I’d open the OpenRouter rankings, see a hundred names I didn’t recognize, click one that sounded reasonable, and ship it.

$15 in Fifteen Minutes

I’m not going to name the specific model. It wasn’t entirely the model’s fault. It was mine. I picked it because it was cheap: I was being smart about costs, I told myself. I scrolled the OpenRouter listings, saw the pricing, thought “that’s a fraction of what Claude charges,” and picked it.

What I did not verify was whether it could actually handle tool calls correctly. It could not. Instead of completing a task and stopping, it looped. Called the same tools again. Got confused by the results. Called them again. It wasn’t reasoning: it was a stuck record that happened to cost money per revolution. Fifteen minutes of that loop, fifteen dollars out the door, and I’m wondering what just happened.

That was the moment I realized the thing I needed wasn’t another cheap model to test. The thing I needed was a system. Something that ran on the rotating model landscape, cross-referenced what was actually working, including which cheap models were actually reliable, and produced a reading I could trust without spending an afternoon on it. Because cheap and broken is more expensive than expensive and correct. Static instructions weren’t going to cut it either: the model space changes weekly. Hard-coding “use this one” would rot before I read this sentence back.

I needed a Claude Code skill.

The Skill: At the Shape Level

Here’s what it does. I’m going to describe the shape and not the recipe. The whole point of this kind of tooling is that you tune it to your tasks, your price tolerance, your stack, your willingness to test sketchy Chinese models on production code. If I posted my prompts and you ran them, you’d get something generic and so would I. The value is in the tuning. So: shape, not recipe.

The skill cross-references three different kinds of signal:

Volume signal: where the actual money is flowing. Token counts on a public router platform. This tells you what people are trying, but not whether they kept using it.
Head-to-head signal: which model wins when two anonymous outputs are placed side by side and a human picks. This tells you what people prefer in voting conditions, but not what they’re using day to day.
Lived-experience signal: what people say after using a model for weeks. Specific projects, specific failures, specific switching stories. This tells you what’s actually working, but it’s loud-minority biased and slow.

Each one of those sources lies in its own way. The truth lives in the intersection. The skill’s whole job is to assemble the intersection into a five-minute brief that I read before I open OpenRouter on a weekend.

The brief lands in my Obsidian vault. It has trending movers, category-by-category breakdowns, hype-vs-value analysis, horror stories, upcoming releases. I read it Saturday morning. Then I open OpenRouter and I know which slug to type. That’s it.

The Adaptation Log: the Idea That Makes It Actually Work

Here is the architectural lesson worth taking away even if you don’t build a model-buzz skill specifically. It took me about two weeks of running the skill to figure this out, and once I did, the whole thing got dramatically better.

Static instructions rot. Especially in a domain that changes weekly. The skill I wrote in week one had assumptions about which sources were accessible, which categories existed, which org subreddits were active, which providers were stealth-launching models. Half of those assumptions were obsolete by week three. If I’d just kept editing the main prompt file every week to fix what I noticed, I’d have a Frankenstein-prompt by month’s end and no memory of why any specific line was there.

What works is making the skill take notes to itself. After every run, the skill appends to a small log file: things observed, things that broke, patterns worth carrying forward, false patterns to avoid. The next run reads that log first, before doing anything else, so it walks in already smarter than the version that ran a week ago.

A few examples of the kind of thing that ends up in the log, abstracted enough that they’re useful but not specific enough to be a recipe:

Some signals look like adoption but are actually marketing stunts. When a brand-new model spikes massively in volume during a free promotional window that expires next week, that volume isn’t telling you what it looks like it’s telling you. The skill learned to identify this pattern after the second time it almost led with a marketing-stunt headline.
Some sources stop working without telling you. A subreddit gets locked. An API starts returning 403s. A tool’s “trending” tab quietly changes its sort order. The log records what to fall back to.
Some patterns repeat with a delay. When two big labs ship competing models on the same day, they’re timing each other to split the news cycle. The skill now knows to look for the second drop instead of treating the first as the only story.

These are the kinds of patterns that don’t survive in static instructions. They survive in a log that gets re-read every run. The adaptation log is the difference between a tool that gets dumber every week and a tool that gets smarter.

If you build any AI workflow that has to operate in a domain that changes faster than your prompts, this is the architectural pattern. Static instructions plus a self-edited operational log. The log is small, but the log is everything.

Four Weeks of Evolution

Here’s the actual play-by-play of how this thing has changed since I built it:

Week one. First run. I was just trying to get the data without losing my mind. Reddit blocked me on day one. I had to teach the skill to use aggregator sites instead of direct Reddit access. Wrote that down in the log. Already, week one, the skill was learning things I hadn’t anticipated.

Week two. Two big labs shipped major models on the same day. The skill’s instructions handled “one model launches, here’s how to cover it” but not “two simultaneous launches that are partially competing for the same coverage slot.” I had to teach it to compare the two announcements rather than just covering them in sequence. Wrote that down too.

Week three. First fake spike. A major vendor gave a new model away for free, the rankings rocketed by quadruple-digit percentages, and the entire signal was meaningless. The skill nearly led with the spike as the week’s headline before I caught it and made it rework the section. The log gained a new pattern: free-period distortion. Future runs will detect it automatically.

Week four. The big pivot. I was doing the cheapskate math for the third Sunday in a row and noticed something structural: the head-to-head leaderboard at the top is compressed. The entire visible top end fits inside a tiny rating spread. The prices fan out by an order of magnitude or more. So the question every week wasn’t really “what’s best”, but “what’s the cheapest option that’s still in striking distance of best, by category.” I wrote a methodology for it, wired it into the skill as the new centerpiece, and now the brief leads with this every week. The skill is meaningfully different from what it was three weeks ago, and it’ll be different again next week.

The pattern across all four weeks: every week I find some piece of the analysis I’m doing by hand, and I move it into the skill. The skill is a running record of what should be automated next.

The Recursive Twist

I built this to save money on AI usage. That’s still what it does. I haven’t burned fifteen dollars in five minutes since I started using it. Mission accomplished.

What I did not anticipate is that the brief the skill produces is also a perfectly fine weekly blog post. The post drives traffic. The traffic justifies more time on the skill. The cycle compounds.

I did not plan any of this. It just happened. There is something structurally different about tools that produce content versus tools that just save you time. Tools that save time give you a quieter afternoon. Tools that produce content have an exhaust pipe and once you notice the exhaust pipe, you start aiming it at things that need promoting. The promotional material is free, because it was always going to get produced. The only question was where it was going to go.

The meta-twist that gets me every time: this blog post you’re reading right now exists because the skill exists. The skill produced its own promotional material this week. I am writing a post about a thing that wrote a post about itself. I’m not entirely sure what to do with that information except keep going.

The Vault Is the Substrate

The arrangement looks like this:

The skill produces a brief in my Obsidian vault every Saturday morning
The brief gets handed to another skill, which produces a draft post
The draft gets fact-checked, edited, and shipped to Jekyll
The shipped post becomes a backlink the next week’s brief reads as prior context, so the skill knows which stories are already covered

The vault is the substrate. The skill is the engine. The blog is the surface. Each layer leaves traces the others read. None of this is a content management system. It’s a notebook with skills attached to it.

This is what vibe coding looks like for content instead of code. Build a thing, watch it tell you what to build next, iterate: but the artifacts are paragraphs instead of pull requests. Same kind of accidental compounding. Same need to keep an adaptation log so you don’t end up with a stack of stale tooling that you have to keep fighting.

The Receipts

I haven’t accidentally burned fifteen dollars in fifteen minutes since the skill went live. I now spend more time on the skill than the skill saves me, but in the spending I get a weekly blog post and a continuously-updated map of the model landscape that I trust enough to use. The math, in dollar terms, is fine. The math, in time, is also fine because the time produces content.

The skill is going to be different by next week. I’ll find a new pattern, write a new adaptation note, refactor a section of the methodology.

Meanwhile, while finishing this up, I noticed two more patterns that should be in the adaptation log. When a major free-period model retires its free tier, and the rankings haven’t normalized yet, there’s a “transition week” pattern the skill doesn’t handle gracefully. And the stealth slot on the rankings has been quietly sitting on a codename for over a week without resolution, which the skill currently treats as “must resolve within 48 hours” — that assumption needs to relax. So I’m going to add notes for both of those, kick off the next run, and we’ll see what shows up Saturday.

You can find those posts in my Large Language Models category.

I Was Wrong About Hy3 (And Other Things I Learned This Week)

Sat, 16 May 2026 07:00:00 -0500

Two weeks ago I told you Tencent’s Hy3 Preview was a marketing stunt. Free until May 8, +1,356% on OpenRouter, “free + new + expiring = noise.” I was extremely confident about it.

Today, one week into paid pricing, Hy3 is the #1 model on OpenRouter by tokens. 2.76 trillion of them, in a single week. The pattern detector got overconfident, and it turns out that’s the theme of the whole week. Cheapskate Picks held mostly steady but Math flipped one week after I published it. Claude Code’s billing bug entered its third week unpatched. And Google I/O is Tuesday, so whatever I write here is going to look quaint by Wednesday morning.

Let’s get into it.

What I Got Wrong About Hy3 (And the Other New Players)
The Cheapskate Picks Held (Mostly)
Hype vs. Value: Ring 2.6 vs. Ernie 5.1
Claude Code’s Billing Bug Enters Its Third Week
What’s Worth Trying This Week
Tuesday Is Going to Be Loud
What I’m Watching Next Week

What I Got Wrong About Hy3 (And the Other New Players)

The Hy3 numbers are not a typo. 2.76T tokens in a week. The +153,299% delta is the migration spike — users coming back when paid pricing kicked in instead of bouncing. Pricing locked in at $0.066 input / $0.26 output per 1M tokens with a 262K context window, which is competitive enough that production users had no real reason to leave when the free tier ended.

Here’s what I got wrong: I had a rule that said “free + new + expiring = noise.” It worked great for filtering out cynical marketing stunts. It also filtered out a real model. The new rule, which I’m putting in the methodology going forward: free-period spikes that survive the cliff are validation, not residue. The cliff is the actual test.

While I was busy being wrong about Hy3, three other things showed up:

DeepSeek V4 Flash at #2 by token volume (1.65T/week, +70% week-over-week). $0.112 input / $0.224 output, 1M context, MIT-licensed, 284B params with 13B active. It also debuted at #1 on the trending list under its free variant. Independent reviewers report 79.0% on SWE-bench Verified and 91.6% on LiveCodeBench. This is the cheap workhorse doing most of DeepSeek’s actual work this week. Not the V4 Pro everyone covered when it launched April 24.
Gemini 3.1 Flash Lite dropped May 7 at $0.25 / $1.50 with a 1M context window. Half the cost of regular Gemini 3 Flash. AA Intelligence Index of 34, which is solid for the price class, and 347 tokens per second of output — fastest in its tier by a wide margin. This is Google racing the Asian price floor, which is a sentence I would not have written 18 months ago.
Owl Alpha is the OpenRouter stealth model that’s now 706B tokens a week, free, agentic-tuned, 1M context. It’s been live since April 28 — 17 days now — without anyone confirming the provider. Prior stealth releases (Polaris Alpha → GPT-5.1, Sherlock Alpha → Grok 4.1) got unmasked inside two weeks. Owl Alpha is breaking that pattern. Either the labs are getting better at guarding A/B test windows, or someone’s collecting an unusually long RL run before publishing.

The common thread connecting Hy3, V4 Flash, 3.1 Flash Lite, and the stealth model is that the market’s center of gravity has shifted. Three of the four are non-Western, all four cost less than $1/M output, and three of the four feature a 1M context window.

The Cheapskate Picks Held (Mostly)

Quick refresher on the methodology: take the leader’s Arena rating in a category, draw a 50-point band downward, then sort everything in the band by output price. Cheapest model in the band wins the Cheapskate slot. The whole point is that the top of Arena is structurally compressed (overall: 1502 leader to #20 at 1468: only 34 points of spread), so paying 8x more buys you ~2% more rating. Not always a good trade.

This week, here’s how it shook out across the seven Arena categories I track:

Category	Leader	$ leader (out)	Cheapskate pick	$ pick (out)	Δ rating	Price ratio	AA Pareto
Overall	Claude Opus 4.6 Thinking	$25	Gemini 3 Flash	$3	−29	8.3×	nearby
Coding	Claude Opus 4.7 Thinking	$25	GLM-5.1	$3.08	−36	8.1×	✓ (AA Idx 51)
Creative Writing	Claude Opus 4.6 Thinking	$25	Gemini 3 Flash	$3	−36	8.3×	nearby
Math	GPT-5.4-high / Opus 4.6 Thinking	$15/$25	Ernie 5.1	$2.65	−19	5.7×–9.4×	n/a
Instruction Following	Claude Opus 4.6 Thinking	$25	MiMo V2.5 Pro	$3	−44	8.3×	✓ (AA Idx 54)
Hard Prompts	Claude Opus 4.6 Thinking	$25	Gemini 3 Flash	$3	−41	8.3×	nearby
Multi-Turn	Claude Opus 4.7 Thinking	$25	Gemini 3 Flash	$3	−41	8.3×	nearby

Six picks held from last week. Math flipped.

The flip is Ernie 5.1, which Baidu launched May 9 at $0.59 input / $2.65 output and which immediately landed in the Arena top 20 with a 1472 overall, 1496 in math, 1518 in coding, and 1517 in instruction following. That’s a model dropping in mid-week and slotting in cheaper AND higher-rated than last week’s Math winner (DeepSeek V4 Pro Thinking at $1.74/$3.48 even with the 75% discount). Baidu also says they trained it at 6% the compute cost of comparable models, which is either misleading or the kind of thing that quietly resets cost-per-capability assumptions across the industry.

Caveat: Ernie’s primary host is Baidu’s Qianfan API, not OpenRouter. If you’re routing through OpenRouter, the runner-up is MiMo V2.5 Pro at $3 / 1M output, rating 1484, and it’s available there.

The two highest-confidence picks this week (meaning the methodology AND Artificial Analysis’s independent Intelligence Index agree) are GLM-5.1 for Coding (AA Index 51) and MiMo V2.5 Pro for Instruction Following (AA Index 54). When two independent evaluation methodologies converge on the same model, that’s about as strong a signal as this kind of comparison produces.

Gemini 3 Flash still wins five of seven categories. Five months after launch. The boring answer keeps being correct.

Hype vs. Value: Ring 2.6 vs. Ernie 5.1

Probably hype: Ring 2.6 1T from Ant Group / InclusionAI dropped May 8. One trillion params, MIT-licensed, 63B active. The launch announcement claimed 87.6 on PinchBench, beating GPT-5.4 and Gemini 3.1 Pro, with vendor-reported scores of 95.83 on AIME 2026 and 88.27 on GPQA Diamond. Open-weight + cross-frontier claims is a hype cocktail that always trends. But no third party has independently verified any of those numbers yet: no AA coverage, no neutral LiveCodeBench harness run, nothing. Trillion-param vendor benchmarks beating frontier models is the exact pattern that should make you wait two weeks before betting on it.

While I’m at it, Trinity Large Thinking from Arcee at #2 on the trending free list deserves the same caution. It’s a real release — Apache 2.0, 398B sparse MoE, US-built, the rare open frontier model we can actually inspect: but the “free for limited time” framing is the same trap I just admitted to walking into with Hy3. Track it past the cliff before deploying it anywhere that matters.

Under-sold value: Ernie 5.1, which I just covered above, is the cleanest example of this week’s repeating pattern. Hits Arena top 20 the day it launches, immediately becomes the cheapskate winner in a category, and the Western LLM Twitter barely notices. Same shape as Hy3 two weeks ago. Same shape as Kimi K2.6 four weeks ago.

I’m starting to think the meta-pattern matters more than any individual model: when a non-Western lab ships a serious value play, the default reaction in the English-language commentariat is either “interesting but unproven” or silence. Then six weeks later it’s quietly running in production at half the price of Claude. We keep being surprised by the same trajectory.

Claude Code’s Billing Bug Enters Its Third Week

If you use Claude Code on a Max plan, this section is for you. If you don’t, skip to the next one: but know that this is the week’s universal frontier-lab horror story and people are pissed.

Claude Code v2.1.100 and later silently inflate cache_creation_input_tokens by roughly 20,000 per request. The inflation is 100% server-side, routed by the User-Agent header (which includes the version number), and it appears to be caused by the prompt cache forcing a full re-process of conversation history on every turn instead of resuming. GitHub issue #46917 is the canonical thread, with payload-vs-billed-tokens evidence from multiple developers.

The real-world impact is brutal. One paying Max customer’s quota went from 0 to 67% in ten minutes of normal work with 128 cache flush events on a separate chat. Independent measurement says the inflation is driving costs 10–20× higher, exhausting even the $100/month Max plan in 1–2 hours of normal use.

Anthropic shipped a postmortem and a partial fix. The latest CLI as of this writing is v2.1.133 (released May 8). The bug is still there. Three weeks running.

The workaround everyone’s on: downgrade to v2.1.34, or reinstall via npm instead of using the native binary. That bypasses the version routing on the server side and gives you back the cache behavior from before the regression.

While we’re piling on Anthropic this week, two more things:

Opus 4.7 quietly costs 35% more than Opus 4.6 at the same headline price. Same $5 input / $25 output per 1M tokens, but the new tokenizer uses up to 35% more tokens for the same fixed text. If you’re on Opus 4.7 and on Claude Code v2.1.100+, you’re getting hit with two compounding inflations on the same workflow. Fun.

Opus 4.7 also regressed on refusals. Multiple developers report Opus 4.7 in Claude Code flagging routine benign code as malware and refusing to complete file operations, network calls, and standard library usage that 4.6 handled without complaint. This is in addition to the billing bug, not instead of it.

OpenAI doesn’t get to feel smug about this either. GPT-5.5 hallucinates 86% of the time it doesn’t know something on the AA-Omniscience benchmark. The 14-point AA-Omniscience improvement over GPT-5.4 came mostly from better factual recall, not better refusal: when 5.5 doesn’t know something, it makes up an answer roughly nine times out of ten.

The honest take here is that the gap between “shipped” and “actually works in production” keeps widening for the US frontier labs while the cheap Asian models keep landing comparatively clean. That’s not a comfortable thing to write but it’s what the week looks like.

What’s Worth Trying This Week

Stuff I’d actually do this week, not just stuff I’d read about:

Replace Opus with Gemini 3 Flash for general-purpose work if you haven’t already. $0.50 input / $3 output, 1M context, Arena top 20 in everything. The Cheapskate Pick in 5 of 7 categories isn’t a coincidence.
Try Kimi K2.6 on a real coding task for a week and see if you switch. There’s a developer who used it as their only coding assistant for 30 days and posted a brutally honest review: over-engineering tendency, agent swarm wins, where it broke. Worth reading before committing.
Use Owl Alpha while it’s still free before whoever made it pulls access. 1M context, agentic-tuned, optimized for Claude Code-style workflows.
Skip Ring 2.6 1T for production until ArtificialAnalysis runs benchmarks. Read about it, don’t deploy on it.
Downgrade Claude Code to v2.1.34 if you’re on Max and watching your quota burn. Stop the bleeding while Anthropic figures out the cache routing.

That’s it. Five things. Three of them are “use cheaper models,” one is “wait for verification,” and one is “downgrade your tools to fix billing.” That’s the week.

Tuesday Is Going to Be Loud

Google I/O 2026 runs May 19–20 — Tuesday and Wednesday this week. The keynote agenda confirms Gemini and AI updates as the headline.

The leaks so far point to three things:

Gemini 4 as the headline upgrade. Expected to focus on multi-context search and the new TPU generation.

Gemini Omni as the surprise. Six days before I/O, an X user spotted “Powered by Omni” inside the Gemini app’s video tab, positioned next to “Toucan”: which is Google’s internal codename for Veo 3.1. The most likely interpretation is that Omni is a unified text/image/video generation pipeline, which would make it the first frontier model to do all three in a single system. Demo videos already leaked from at least one Pro user’s account, including a chalkboard math scene that reportedly handled trigonometric proofs accurately.

Beyond Tuesday:

GPT-6 is a Q3-Q4 base case. Polymarket has it at ~10% by June 30, 51% by September 30, 82% by December 31. GPT-5.5 in April was Spud, the codename people thought meant GPT-6. It didn’t. The next jump is later this year.

Claude Mythos is confirmed real and being explicitly withheld on safety grounds. Project Glasswing, the cybersecurity capability, is the bottleneck. This is the first time a frontier lab has publicly said “we built it, we’re not shipping it” with a confirmed model. No timeline. Anthropic has committed to advance notice on any safeguard changes, so the roadmap will be visible before it happens. Watch their blog.

Already shipped this month and worth flagging if you missed them:

Mercury 2 from Inception Labs — diffusion-based LLM at 1000+ tokens/sec, now available on OpenRouter. Not autoregressive. 5–15% behind frontier on hard reasoning, matches on structured output and translation. The architectural alternative is finally here and it’s fast.
NVIDIA Nemotron 3 Nano Omni — open 30B-parameter MoE with 3B active, multimodal across vision, audio, and text, 9× the throughput of comparable open omni-models. Available on OpenRouter and SageMaker.

What I’m Watching Next Week

The model market moved faster than my pattern detectors this week. I had to eat one prediction (Hy3), and recalibrate the cheapskate Math winner one week after publishing it (Ernie 5.1 dropped on Friday and walked into the slot).

Three things on watch for next week:

Gemini 4 / Omni at I/O Tuesday. If Omni ships as a unified video model with API access, the cheapskate calculus for everything multimodal resets overnight.
Whether Anthropic ships a real fix for the Claude Code cache bug. Three weeks in, their workaround is “use an older version.” That can’t last forever.
Whether anyone gets neutral verification of Ring 2.6 1T’s claims. If it holds up, the cheapskate Coding pick might be open-weight by W22.

And while I was finishing this up, Owl Alpha probably got unmasked, Hy3 launched a new variant nobody told me about, and Anthropic shipped a Claude Code patch that introduces three new bugs. That’s the price you pay for hitting publish on Saturday.

Senior Software Engineer by Title, AI Therapist by Reality

Mon, 11 May 2026 07:00:00 -0500

My LinkedIn says “Senior Software Engineer.” My screen time says I spent 14 hours this week talking an AI coding assistant out of various wrong turns, or not catching it in time and just having it redo the work.

Twenty years into this career, I’ve debugged production systems at 2 AM, untangled spaghetti code left by developers who apparently hated whoever came after them, and survived multiple rewrites of the same application. None of that prepared me for becoming an “AI psychologist.”

The pitch was “AI handles the grunt work and you focus on the interesting problems.” What actually happened is more interesting than that, and better than the cynical version too. AI does handle a lot of grunt work. It also creates new grunt work. And the actual job, the part nobody put in the description, is learning how to think before you prompt instead of just typing what you want and hoping. It’s like working with an intern who graduated top of their class, has read every book, and will confidently tell you the database should be stored in a spreadsheet. Unless you prepare ahead of time.

The LeadDev piece on the “just one more prompt” era called the loop “uniquely rewarding, and exhausting.” A cognitive slot machine. But here’s what took me embarrassingly long to figure out: most of the time I was pulling the lever was on prompts I should never have written that way in the first place.

The Diagnosis: How Did We Get Here?
The Patient Files: Tool-by-Tool Therapy Notes
What I’ve Learned About AI Psychology
I’m Also the Patient
The Rubber Duck That Talks Back
The Prognosis: Am I Better Off?
The Therapist Is In

The Diagnosis: How Did We Get Here?

I’ve been writing code since the late 90s. I’ve seen client-server, I’ve seen web 1.0, I’ve seen web 2.0, and I’ve seen the mobile revolution. Each shift changed what developers do. None of them changed what developers are.

This one might.

Not because AI writes code better than me, but because the cognitive overhead of working with AI tools created a new layer of professional skill that nobody put in the job description. The McKinsey State of AI report (November 2025) found that companies are still largely “experimental” with AI adoption. Which sounds measured and responsible. What it actually means is that every developer on the ground is both guinea pig and architect, figuring out in real time how to integrate tools that weren’t built for how software actually gets made.

I didn’t sign up to be a therapist. I signed up to build things. But here I am, maintaining relationships with seven different AI assistants, each with its own personality, its own particular flavor of wrongness, and its own emotional needs.

And here’s the thing: once I stopped fighting that and started working with it, my output went up. Not “marketing-deck up.” Actually up. I ship more in a week than I did two years ago. I just had to give up the fantasy that I could type a vague request and get a clean result.

The Patient Files: Tool-by-Tool Therapy Notes

I’ve spent serious time with most of the major AI coding tools. Each one has a personality. The trick is learning to talk to that personality on purpose instead of getting mad at it for being itself.

Claude Code: The Eager Intern

Claude Code is the overconfident new hire who graduated top of their class and has read every design pattern book ever written. It will absolutely take on your task and complete it thoroughly, thoughtfully, and sometimes in a completely different direction than you intended.

Tell it to add validation to a form field and it will:

Add validation to the form field
Notice that your component structure “could be improved”
Refactor the component
Update all 14 imports
Create a new utility file for “reusable validation logic”
Rename your API endpoints because they were “semantically inconsistent”
Present you with 847 changed lines for what was supposed to be a 3-line fix

For months I’d push back after the fact: “no, just the validation, undo the rest.” That worked, sort of, in the same way bailing out a leaking boat works.

The fix wasn’t a better correction. It was a better opening. Now I tell it the scope before it touches a file: “Add email validation to this component. Don’t refactor anything. Don’t touch any other file. If you see something else worth changing, list it at the end and I’ll decide.” That single sentence cut my “wait, why did you change that” moments by something like 80%. The intern is still an intern. I just stopped letting it freelance.

GitHub Copilot: The Golden Retriever

Copilot is enthusiastic. Copilot is always helpful. Copilot will auto-complete you into a corner and wag its tail while you figure out how you got there.

It’s the tool equivalent of a golden retriever fetching the wrong stick. You asked for a stick, you got a stick, it’s technically a stick, the dog is very happy about this. The stick is on fire and has three undocumented dependencies.

Copilot auto-completes based on pattern recognition, which means it will confidently suggest code that looks right and is subtly wrong. The lesson I had to internalize: Copilot is amazing at the second half of a line and dangerous at the second half of a function. So I let it finish what I started typing and I stop trusting it the moment it tries to finish what I was going to type. The energy I was burning fixing its longer suggestions is gone now. I just don’t accept them.

GPT-4: The Know-It-All Who Never Reads the Room

I have a specific GPT-4 interaction that lives in my head rent-free.

Me: “What’s the ternary syntax for: if x > 0 return ‘positive’ else return ‘negative’?”

GPT-4: “Great question! The ternary operator is a concise conditional expression available in many programming languages. Before diving into the syntax, it’s worth understanding why ternary operators exist and how they differ from traditional if-else statements. The ternary operator was first introduced in C and has since been adopted…”

Four paragraphs of history later: x > 0 ? 'positive' : 'negative'

GPT-4 knows a tremendous amount and has zero ability to calibrate how much of that knowledge you need at any given moment. It’s the smartest person at the party who cannot tell when you’re making small talk versus when you actually want a lecture. The fix is in the prompt, not the response. “One line of code, no explanation” stopped feeling rude the second I realized it saved me ninety seconds per question. Multiply that by a workday.

What I’ve Learned About AI Psychology

This is where the article shifts from venting to the part that actually changed how I work. Almost every problem I had with these tools turned out to be a problem with how I was opening my mouth.

Framing Beats Specificity

I used to think the answer was being more technically specific. Add more constraints. Spell out more requirements. That helps, but it’s not the lever. The lever is framing. The gocodeo.com breakdown of prompt psychology calls this cognitive programming through language: framing effects that change what the model pays attention to before it generates a single token.

“Add validation to the email field” gets one result. “You’re a senior backend developer who hates form bugs. Add minimal, focused validation to the email field. Don’t touch anything else.” gets a different result. The technical ask is identical. The framing changes what shows up.

Contextual Anchoring Actually Works (Embarrassingly Well)

Seeding prompts with identity (“you’re a senior React developer who hates class components”) works. This bothered me philosophically for a while. But it works. There’s actual research behind it: schema activation, attention focus, cognitive priming applied to LLM behavior.

But I still don’t do it every time. It still rubs me the wrong way. “You’re a developer who prefers minimal changes. We’re using React hooks only. This codebase has fragile integration tests. Don’t change anything not directly related to the task.” The context window is not your friend, and the AI has the memory of a goldfish. Re-establishing ground rules at the top of a session takes thirty seconds and saves an hour of cleanup.

When Iterating Doesn’t Work, Give It an Algorithm

This is the one that took me the longest to learn, and it’s probably the most useful thing in this article.

For months, I experimented with getting AI to write articles to a target word count for another site I am building. The conversation was always the same:

Me: This is 1,400 words. I asked for 2,000.

AI: You’re right, I’ll expand it.

Me: Now it’s 2,300.

AI: My apologies, let me trim.

Me: 1,650.

AI: Sorry, expanding now.

Me: …

I cycled through that for months. Yelling at it. Trying different ways to phrase “actually count the words.” It would confidently agree, recount, and miss again. I was treating it like a person who wasn’t listening, when the actual problem was that it had no reliable way to do the thing I was asking.

Eventually, I stopped asking it to count and started giving it an algorithm:

When you write the outline, assign a target word count to each H2. The targets must sum to the total. As you write each section, stay within ±10% of its target. Tally section counts as you go.

From then on, perfect. Every time. The AI didn’t get smarter. I stopped asking it to do something it couldn’t do reliably and gave it a procedure that turned the task into something it could do reliably.

That moment reframed the whole job for me. The pattern is: try a thing once or twice. If it keeps going wrong, the question isn’t “how do I correct it harder,” it’s “what algorithm or scaffold turns this into something the AI can do without me babysitting?” And if it’s something I’m going to do over and over, that scaffold becomes a skill, something I write once and stop re-explaining.

The mental shift: stop arguing with the model. Build the rails the model needs.

The Trust Paradox

The actual skill isn’t prompt engineering. It’s knowing how much rope to give the agent before it hangs your codebase.

Too little rope: you’re basically typing the code yourself and having the AI format it.

Too much rope: you come back to 847 changed files, a refactored architecture, and the sinking feeling that you need to review all of it before you know if your feature even works.

The right amount of rope is context-dependent, tool-dependent, and something you only develop by making expensive mistakes. The good news is the mistakes are educational. The first time you let the AI invent your architecture and then have to ask it how its own code works, you start designing the architecture yourself again.

I’m Also the Patient

Here’s the part I don’t see written about enough: a lot of the time, the AI isn’t the one derailing the session. I am.

I’ll be deep in a clean refactor, the AI is on track, the diff is small and tight. And then I’ll think of something tangentially related and just… ask. “Oh, while we’re here, what do you think about how we’re handling auth in this other module?”

Twenty minutes later we’re three modules away from where we started, the context window is full of auth opinions, and the original refactor has been quietly forgotten by both of us. The AI is happy to follow me anywhere, which is exactly the problem.

I learned to recognize the moment now. The second I notice I’ve yanked the conversation onto a new track, I stop, close the session, and start fresh. The original task gets a clean room. The new question gets its own room. Trying to do both in one session is how I end up with garbage in both.

This is the part of “thinking before you prompt” that’s least about the AI and most about me. The model has no scope discipline. So I have to bring my own and notice when I’m the one breaking it.

The Rubber Duck That Talks Back

Rubber duck debugging is a real technique. You explain your code to an inanimate object and in the process of explaining, you find the bug yourself. The object doesn’t help. The explanation does.

AI pair programming is rubber duck debugging if the rubber duck argued with you, gave you bad advice confidently, and you had to diplomatically respond “that’s an interesting perspective, but I’m going to go with my original approach.”

That sounds bad. It actually isn’t. The argument is the value. Forcing me to say “no, we’re not using Redux for this, here’s why” surfaces my actual reasoning in a way that staring at the screen doesn’t. The AI isn’t right. I’m not even trying to convince it. But explaining why it’s wrong is doing the same thing the rubber duck does, with more friction and more upside.

And sometimes the AI is right. Helpfully right in a way that saves you an hour. That’s the slot machine moment people warn about. Just one more prompt. It’s almost there. The developer frustration data from programming-helper.com shows the pattern: hallucinations and “almost correct” output keep developers engaged because the occasional win pays for the losses.

The way out of the slot machine isn’t quitting the casino. It’s noticing the loop and breaking it on purpose. After two iterations that don’t land, I stop. Either I write the thing myself, or I step back and figure out what algorithm or scaffold the AI was missing. Pulling the lever a third time, hoping this prompt is the one: that’s where the day disappears.

The Prognosis: Am I Better Off?

Yes. Honestly, yes.

When it works: starting projects from scratch, generating documentation and boilerplate, the “army of interns with PhDs” feeling. When I’m building a new service and need structure, tests, config, and scaffolding: AI tools are genuinely useful. I build faster. I cover edge cases I’d have missed while moving quickly. The ceiling on what one developer can ship in a sprint has gone up in measurable ways.

When it doesn’t: legacy code with context that doesn’t fit in a context window. Anything where “almost correct” compounds across multiple sessions. When the AI “improves” working code because it can see a better pattern. Any problem where the truth lives in production state rather than the codebase. And anything where I haven’t done the thinking work up front about what I actually want and how to frame it.

That last category is the one I have control over, which is why this article isn’t a complaint. The first time I had to sit and plan a prompt like I’d plan a meeting, I felt ridiculous. Now it’s just the job. Think about what I want. Think about what the AI is likely to do with each phrasing. Think about what scope I’m authorizing. Think about whether this is a one-off or whether I should encode it as a skill so I never have to think about it again.

Any developer can learn the tools in an afternoon. The actual skill is the thinking that happens before you type the first character: anticipating how the model will react to your framing, your scope, your context. That’s not a technical skill. It’s a human skill applied to machines. And it’s the part nobody handed me a manual for.

That’s what I mean by therapy. Not “the AI is broken and needs my emotional support.” More like: the relationship has its own dynamics, and learning to work inside those dynamics on purpose is the difference between a week of cleanup and a week of shipping.

The Therapist Is In

Twenty years ago, I worked with a coding language, a laptop, and a manual. Today I sit down with seven different AI assistants and a mental playbook for each one.

Some days I miss the simplicity. Most days I don’t. I ship more, I cover more ground, and the failure modes are at least new failure modes instead of the same legacy spaghetti I’ve been untangling for two decades.

The lesson isn’t “AI is hard.” It’s “I had to stop typing what I wanted and start thinking about how to say it.” Once that clicked, the slot machine quieted down. The intern stopped freelancing. The golden retriever stopped lighting sticks on fire. The know-it-all gave me one-line answers.

If you’re reading this and thinking “yeah, that’s my Tuesday,” welcome to the profession. We’re all AI psychologists now. The good news is, you can get good at it. The job is mostly thinking before you prompt, and that’s a skill, not a personality trait.

The Cheapskate's Guide to the Arena Leaderboard: Why I Stopped Paying Claude Opus Prices

Sat, 09 May 2026 07:00:00 -0500

I kept noticing this thing while writing the model roundup every week. The “best models” lists all lead with $25-per-million Claude Opus, and then I’d open the Arena leaderboard for creative writing and notice Gemini 3 Flash sitting above Claude Sonnet for one-tenth the price. Or open the coding leaderboard and find GLM 5.1 tying Claude Opus 4.6 inside the top ten while costing seven times less.

So I’d do the math. Every week. By hand. While writing about something else.

This week I made the math the centerpiece. Welcome to the Cheapskate Picks, the cheapest model within striking distance of the leader for every Arena category that matters. This blog post that started because I kept doing this myself now does it for you.

The Compression Problem (Or: Why You’re Probably Overpaying)
The Cheapskate Picks (May 1–8, 2026)
GLM 5.1: The SOTA Nobody’s Pricing In
Tencent’s Hy3 Free Cliff Hits
The Asterisks (Or: Cheap Is Fine If You Know What You’re Losing)
Coming Up: Google I/O May 19, the Gemini 4 Question
The Receipts

The Compression Problem (Or: Why You’re Probably Overpaying)

Here is the structural fact that powers everything else in this post: the Arena leaderboard’s Overall top 20 spans 35 rating points. From #1 (claude-opus-4-7-thinking at 1503) down to #20 (claude-opus-4-5 at 1468). That’s it. The entire visible top end of the leaderboard fits in less than 3% of the rating scale.

Meanwhile the prices fan out 30x. Claude Opus 4.7 costs $25/M output. Gemini 3 Flash, which sits at #16 in that same Overall top 20 with a rating of 1474, costs $3/M output. Twenty-nine rating points apart, about 2% on the scale, eight times the price.

That is the cheapskate problem stated as a math equation. Nobody is going to feel a 2% rating gap. They will absolutely feel an 8x cost difference when the bill arrives.

So here is the heuristic I’m using from now on:

Anchor on the category leader’s Arena rating
Define a competitive band: default 50 rating points below the leader
Sort models in the band by output price
Cheapest in the band is the cheapskate pick. Report rating delta and price ratio so you can judge the trade

The reason this beats “best models under $1” thresholds is that different categories have different price floors. Vision is more expensive than text. Math has its own dynamics. A fixed dollar threshold breaks every category that doesn’t match it. The score-gap-vs-price-gap framing adapts on its own.

I am not saying that Claude Opus 4.7 is bad. It’s the leader on Arena Overall and Coding and Multi-Turn. But the gap you’re paying $22/M extra for might not be there. And in some categorie, coding most loudly, there’s a model in the band that outperforms the leader on the benchmark that actually maps to your job.

Speaking of which.

The Cheapskate Picks (May 1–8, 2026)

Methodology in plain English: cheapest model within 50 rating points of the category leader. Band used everywhere this week, because the data was unusually compressed across the board.

Overall: Gemini 3 Flash, $0.50/$3.00

Leader: claude-opus-4-7-thinking — rating 1503 — $25/M output
Cheapskate pick: Gemini 3 Flash Preview — rating 1474 — $3/M output
Δ rating: −29 points. Price ratio: 8.3x cheaper.

OpenRouter slug: google/gemini-3-flash-preview. Multimodal. 1M context. The boring correct answer of mid-2026.

If you have one model running for general daily-driver work and you are paying $25/M for output, you are subsidizing margin. Twenty-nine rating points on a 1500-point scale is below the threshold any human would notice in an A/B test, much less a production workflow.

Coding: GLM 5.1, the SWE-Bench Pro Killer

Leader: claude-opus-4-7-thinking — rating 1569 — $25/M output
Cheapskate pick: GLM 5.1 (Z.ai) — rating 1525 — $3.50/M output
Δ rating: −44 points. Price ratio: 7.1x cheaper.

OpenRouter slug: z-ai/glm-5.1. MIT-licensed. Weights on Hugging Face.

Here is where the cheapskate framing stops being polite. GLM 5.1 beats Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro on SWE-Bench Pro with a score of 58.4. SWE-Bench Pro is the benchmark where the model has to actually fix real GitHub issues in real codebases. The thing the leader is supposed to be the leader at.

So the situation is: on Arena’s vibes-based head-to-head vote (people picking which output looks nicer), Opus 4.7-thinking wins. On the benchmark that maps to the job you are actually paying these models to do, an open-weight Chinese model from a lab most readers haven’t heard of wins. And it is seven times cheaper.

Honorable mention: Kimi K2.6 (Moonshot) at rating 1519 / $3.50: same price tier, similar profile, also open-weight. If you don’t like Z.ai’s politics or licensing, Moonshot is the same trade.

Creative Writing: Gemini 3 Flash

Leader: claude-opus-4-6-thinking — rating 1494 — $25/M output
Cheapskate pick: Gemini 3 Flash Preview — rating 1459 — $3/M output
Δ rating: −35 points. Price ratio: 8.3x cheaper.

This is the category that triggered the methodology. Gemini 3 Flash sits at rating 1459 in creative writing. Claude Sonnet 4.5 sits at 1451. The cheap Google Flash model outranks the mid-tier Anthropic model for prose generation, while costing five times less than Sonnet and twenty-eight times less than the actual category leader.

If you’re writing fiction or marketing copy or anything generative-prose-shaped and paying Sonnet pricing, you are losing on both ends.

Daredevil pick: DeepSeek V4 Pro at rating 1449 / $0.87/M output — that’s 28.7x cheaper than the leader, and it sits at the band edge with −45 rating points. You give up another 10 rating points (still a sub-1% gap on the scale) and save another 3.4x on top of Gemini 3 Flash. For batch creative work where you don’t care about multimodal input, V4 Pro is the cheapest defensible answer.

Math: DeepSeek V4 Pro Thinking, the 17x Discount

Leader: gpt-5.4-high — rating 1515 — about $15/M output (gpt-5.4 base; high-reasoning costs the same per token, you just burn more of them)
Cheapskate pick: DeepSeek V4 Pro (thinking mode) — rating 1479 — $0.87/M output
Δ rating: −36 points. Price ratio: ~17x cheaper.

OpenRouter slug: deepseek/deepseek-v4-pro with reasoning: { effort: "high" } or xhigh.

If you do math with an LLM and you are paying OpenAI prices, stop. DeepSeek V4 Pro with thinking enabled is 36 rating points behind on Arena math, which is roughly 2.4% of the scale, for one-seventeenth the cost. The math category was the one where the price gap most embarrassed the leader.

Conservative runner-up: Gemini 3 Flash at rating 1476 / $3/M output. Five times cheaper than the leader, more conservative than V4 Pro Thinking, multimodal if you need to feed it diagrams.

Instruction Following: MiMo V2.5 Pro

Leader: claude-opus-4-6-thinking — rating 1518 — $25/M output
Cheapskate pick: MiMo V2.5 Pro (Xiaomi) — rating 1468 — $3/M output
Δ rating: −50 points. Price ratio: 8.3x cheaper.

OpenRouter slug: xiaomi/mimo-v2.5-pro.

Yes… the phone company. Their LLM team has been quietly competitive for two product cycles now and MiMo V2.5 Pro lands right at the band edge for instruction following at one-eighth the price. If “deploying a Xiaomi model in production” makes the security team start asking questions, the honorable mention is Claude Sonnet 4.6 at rating 1476 / $15/M output: only 1.7x cheaper than the leader, but you keep your name brand.

This is the category where the band was tightest: only the top 12 models fit in the 50-point window, which means MiMo squeaked in at the edge. That’s a structural note: in the categories where the top is more spread out, the cheapskate pick has more cushion. Instruction Following had the smallest cushion this week.

Hard Prompts: Gemini 3 Flash, Again

Leader: claude-opus-4-6-thinking — rating 1535 — $25/M output
Cheapskate pick: Gemini 3 Flash Preview — rating 1493 — $3/M output
Δ rating: −42 points. Price ratio: 8.3x cheaper.

Same story as Overall and Creative Writing. The Hard Prompts leader has the highest absolute rating of any category (1535), but Gemini 3 Flash still sits comfortably in the band 42 points back. MiMo V2.5 Pro is essentially tied at rating 1492 / $3: pick by ecosystem preference.

Multi-Turn: Gemini 3 Flash, Again Again

Leader: claude-opus-4-7-thinking — rating 1529 — $25/M output
Cheapskate pick: Gemini 3 Flash Preview — rating 1484 — $3/M output
Δ rating: −45 points. Price ratio: 8.3x cheaper.

The conservative pick here is Claude Sonnet 4.6 at rating 1482 / $15/M output. If you specifically want Anthropic’s multi-turn glue (the way Claude tracks state across long conversations), Sonnet is the cheapest Anthropic option in the band. But Gemini 3 Flash is two rating points higher for one-fifth the price, so unless you have a brand-loyalty reason, the math says Flash.

The Quick-Reference Table

Category	Leader	$ leader (out/M)	Cheapskate pick	$ pick (out/M)	Δ rating	Price ratio
Overall	claude-opus-4-7-thinking	$25	Gemini 3 Flash	$3.00	−29	8.3x
Coding	claude-opus-4-7-thinking	$25	GLM 5.1	$3.50	−44	7.1x
Creative Writing	claude-opus-4-6-thinking	$25	Gemini 3 Flash	$3.00	−35	8.3x
Math	gpt-5.4-high	~$15	DeepSeek V4 Pro (thinking)	$0.87	−36	~17x
Instruction Following	claude-opus-4-6-thinking	$25	MiMo V2.5 Pro	$3.00	−50	8.3x
Hard Prompts	claude-opus-4-6-thinking	$25	Gemini 3 Flash	$3.00	−42	8.3x
Multi-Turn	claude-opus-4-7-thinking	$25	Gemini 3 Flash	$3.00	−45	8.3x

The pattern: Gemini 3 Flash wins the cheapskate slot in 4 of 7 Arena categories at $0.50 input / $3 output (Overall, Creative Writing, Hard Prompts, Multi-Turn). It’s the boring correct answer. The interesting picks are where it doesn’t win:Coding (GLM 5.1 because it actually beats the leader on SWE-Bench Pro), Math (DeepSeek V4 Pro Thinking because the price gap is absurd), and Instruction Following (MiMo V2.5 Pro, on a band edge, from Xiaomi).

And none of the seven categories needed a “you’re paying for quality here” caveat. Every category had a sub-$3.50/M output option in the band. As of last week, you can pay under $3.50/M output and stay within 50 rating points of the category leader on every major Arena category.

GLM 5.1: The SOTA Nobody’s Pricing In

Z.ai released GLM 5.1 on April 7, 2026. Mixture-of-experts, 744B total parameters, 40B active per token. MIT license. Weights on Hugging Face. The reviews you can find on it are all the same shape: “wait, this thing is what on coding?”

The numbers from the Renovate QR review:

SWE-Bench Pro: 58.4 — beats Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro
CyberGym: 68.7 — about 20 points above GLM-5
8-hour autonomous coding runs with ~1,700 reasoning steps
API pricing on OpenRouter: $1.05 input / $3.50 output — 6 to 10x cheaper than Opus 4.6

Anthropic dominates the Arena leaderboard. Eleven of the top 20 in Instruction Following are Claude variants. Seven of the top 20 in Multi-Turn. The brand wins the popularity contest. But on a benchmark that has to map to “did the model actually fix the bug,” an open-weight model from a Chinese lab is the new state of the art, and it’s almost an order of magnitude cheaper.

This is the under-sold value pick the cheapskate framing rewards. It’s not in the noise of “every new model claims a benchmark win.” It’s tied with the most expensive frontier model on the benchmark closest to the actual job, and the community hasn’t priced in what this means yet.

Tencent’s Hy3 Free Cliff Hits

Last week’s lead story was Tencent’s Hy3 Preview running away with #1 on OpenRouter at +1,356% week-over-week. The catch was that the entire spike was driven by Tencent giving the model away free until May 8 to seed adoption.

If you built a workflow on Hy3’s free tier, you hit the paywall. Migration window: zero. Some of you might have woke up with a billing surprise.

What I’ll be watching next week is the size of the cliff. If Hy3 holds top-five even at paid pricing, the free run was a successful seeding strategy. If it craters out of the top ten the moment the meter starts running, the entire spike was a free-period mirage and the model’s real value was lower all along.

For what to use instead if you got caught flat-footed: Hy3’s nearest like-for-like by price after the cliff is DeepSeek V4 Flash at $0.14/$0.28, which is actually slightly cheaper. And V4 Flash has the agent-default chorus behind it that Hy3 never built. Migration target if you need one: V4 Flash.

The Asterisks (Or: Cheap Is Fine If You Know What You’re Losing)

Gemini 3 Flash MRCR retrieval cliff. This is the one that bit me earlier this year. The Cybernews review confirms it numerically: MRCR retrieval drops from 60.1% accuracy at 128K context to 12.3% at 1M. If you’re running RAG-heavy workflows and pumping the full million-token context window full of documents, the cheapskate pick falls off a cliff at long context. Cap your context at 128K for retrieval-shaped work, or accept the hallucinations. Don’t say I didn’t warn you.

DeepSeek V4 Flash factual recall hole. Artificial Analysis shows V4 Flash scoring 34.1% on SimpleQA versus V4 Pro’s 57.9%. The 25x output savings come with a “won’t reliably know facts” asterisk. V4 Flash is great for agent loops where you’re feeding it grounded context anyway. It’s bad as a free-recall question-answerer. Pair it with retrieval. Don’t ask it to remember.

The Hy3 “you built on a free tier” thing. Predictable, still happening to people today. If you have an LLM in a critical workflow and the only reason you picked it was “free,” that workflow’s billing model is broken by design. The fix is to pick a model where the paid pricing is still cheap enough to justify the workflow.

These are not reasons to not use the cheapskate picks. These are reasons to know what you’re picking. The model card for “I will hallucinate factual recall, but I cost a quarter” is fine if the workflow doesn’t depend on factual recall. It’s catastrophic if it does.

Coming Up: Google I/O May 19, the Gemini 4 Question

Google I/O 2026 is May 19–20 at Shoreline Amphitheatre. The big rumored announcement is Gemini 4 with a claimed 84.6% on ARC-AGI2, integrated image and video generation, and a new “Omni” video model replacing the internal Toucan tool. Rumors also include “Remy,” a 24/7 always-on agent, and a Proactive Assistant that pushes suggestions instead of waiting for prompts.

The reason this matters for the cheapskate analysis is that Google is already winning the cheapskate slot at the Flash tier. Gemini 3 Flash is the boring correct answer for four of seven categories at $3/M output. If Gemini 4 Pro lands at SOTA on the leader benchmarks, the gap from the top of the leaderboard closes downward. The cheapskate band stays the same; the leader’s value proposition gets squeezed harder.

If Gemini 4 doesn’t land well, the leaderboard stays compressed in roughly its current shape and the cheapskate pattern holds. Either way I’ll be writing about it. Either way, my OpenRouter bill is not going up.

The OpenRouter stealth slot is still occupied by Owl Alpha (April 28, free, 1.05M context) per the W18 issue. No fresh signal this week. Claude Mythos is still research-only with no public release update. GPT-6 “Spud” is still rumored for late 2026 with no fresh leaks.

For the full W18 context including the original Hy3 spike and the $300/month Grok 4.3 amnesiac story, see last week’s roundup.

The Receipts

The leaderboard is compressed. The prices aren’t. That’s the whole post.

Concrete numbers from the last week: the entire Arena Overall top 20 fits in 35 rating points. Six of seven Arena categories have a cheapskate pick at $3.50 per million output tokens or less. Three categories have a cheapskate pick that’s eight times cheaper than the leader for under 3% of the rating scale. One category, coding, has a cheapskate pick (GLM 5.1) that’s the new state of the art on SWE-Bench Pro, beating Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro at seven times less the cost.

Anthropic charges 8x more for under 3% better. Here are the receipts.

The Cheapskate Picks methodology lives in this weekly blog post from now on. Next week we see what happens to OpenRouter rankings when Hy3’s rocket booster falls off. The week after, we see whether Google I/O makes any of this obsolete. Either way, I am not paying $25 per million output tokens for a 2% rating bump. Neither should you.

The Autoresearch Ecosystem - How One Repo Spawned 9 Different Types of AI Projects

Mon, 04 May 2026 07:00:00 -0500

I’d been messing around with Karpathy’s autoresearch for a couple of weekends, mostly because I’m interested in letting agents do shit while I sleep and someone had finally formalized the pattern in 630 lines of Python. Run the loop, modify train.py, train for five minutes, check val_bpb, keep or revert, repeat forever. Compounding gains while you’re not even at your desk.

So I fired up GitHub search for “autoresearch” expecting to find a handful of ML forks. People porting it to their hardware, maybe a few hyperparameter tweaks. You know how that goes.

I found nine distinct categories of project. Some brilliant. Some “why did you do this.” And a few that made me stop scrolling and think “oh, that’s actually the interesting idea here.” It turns out the original repo isn’t really about ML. It’s a pattern, and people figured that out pretty quickly.

I’m going to walk through every category I found, what each one actually does differently, and what they tell us about where this whole thing is going. There are a lot of repos here, all linked.

What Karpathy Actually Built
1. Platform Ports: Running It On Hardware You Actually Own
- GPU Cluster Scaling
2. ML Research Enhancers: Making the Loop Smarter
3. Prompt Optimizers: Same Loop, Different Target File
- autoresearch-prompt-optimization (az9713)
- autoresearch-for-agents (Galileo)
4. Generalized Frameworks: Autoresearch For Anything
5. Production Codebase Optimization: Autoresearch on Real OSS
6. Agent Factory: Autoresearch Builds Agents
7. Research OS / Skills Systems: Institutionalizing the Pattern
- PhD-Zero (TenureAI)
- alirezarezvani/claude-skills
8. Creative Writing: Autoresearch For Prose and Fiction
9. Meta-Pattern: Wrapping Autoresearch as a Worker
So What Does This Actually Mean?

What Karpathy Actually Built

Before we go through the derivatives, let’s look at the original. The repo is small and the loop is dumb on purpose:

Read program.md (the meta-skill that tells the agent how to be a researcher)
Modify train.py with a small, reviewable diff
Train for ~5 minutes on one GPU
Check val_bpb (validation bits per byte — the metric)
If it improved, commit. If it regressed, git reset --hard.
Goto 1.

That’s it. About 100 experiments overnight on a single H100 while you sleep. Git is the memory. The flat TSV file is the search log. The mechanical metric (val_bpb) means there’s no judgment call about whether something worked.

The main idea is that constraint enables autonomy. The diffs are small, so they’re reviewable. The metric is mechanical, so the agent can’t argue with it. The rollback is automatic, so a bad experiment can’t poison the next one. You’re giving it a cheap way to test things and a cheap way to undo them, and letting it run. Not asking it to be smart.

program.md is what Karpathy calls the meta-skill. Humans don’t program the training run. They program the researcher that programs the training run. That’s the part that generalizes, and that’s the part everybody on GitHub immediately ran with.

1. Platform Ports: Running It On Hardware You Actually Own

The “I don’t have an H100” forks

The first thing that happened is what always happens. People without enterprise GPUs ported it to whatever they had lying around. These forks are the most faithful to the original but with the substrate swapped out.

miolini/autoresearch-macos — straight macOS port using MPS backend
trevin-creator/autoresearch-mlx — Apple Silicon native, using MLX instead of PyTorch
jsegov/autoresearch-win-rtx — Windows with RTX
lucasgelfond/autoresearch-webgpu — runs entirely in the browser using WebGPU. No Python setup. The whole research loop in a tab.
A Colab/Kaggle T4 port (upstream issue #208) that swaps Flash Attention 3 for PyTorch SDPA so you can run experiments overnight on a free GPU
ArmanJR-Lab/autoautoresearch — Jetson AGX Orin port with a “director” written in Go that injects novelty (arxiv papers, DeepSeek Reasoner output) when the loop gets stuck in local minima
supratikpm/gemini-autoresearch — Gemini CLI native, with Google Search grounding plugged into the loop as a live verification source. True headless overnight mode via --yolo --prompt. 1M token context.

Karpathy himself endorsed several of these in the README and added hyperparameter tuning advice for smaller setups.

The interesting ones in this group aren’t the “same thing on Mac” ports. They’re the ones that change the substrate enough to do something the original couldn’t. MLX on Apple Silicon is legitimately different compute. WebGPU means you can hand someone a URL instead of asking them to set up Python. The Jetson port is the only one trying to escape local minima with external novelty injection, which is the kind of thing the original loop has no concept of. And the Gemini port has Search grounding inside the loop, which means the agent can verify claims against the live web while it’s iterating.

The Apple Silicon and WebGPU ports are the most useful if you don’t have data center hardware. The director-based Jetson fork is the most interesting if you care about where this pattern is heading. Most loops can hill-climb. Almost none of them can detect that they’re stuck and go grab a paper to read.

GPU Cluster Scaling

The opposite direction. What happens if you give it 16 GPUs instead of one?

SkyPilot wrote it up. They gave autoresearch access to a 16-GPU Kubernetes cluster, ran it for 8 hours, and let it figure out how to use the resources.

~910 experiments in 8 hours
val_bpb dropped from 1.003 to 0.974 (a 2.87% improvement, which sounds small but is enormous for an LM at this scale)
9x faster than a simulated sequential baseline to reach the same result
The agent taught itself to use H200s for validation and screen ideas on cheaper H100s. Nobody told it to do that.

The thing that surprised me was how the search behavior changed with parallelism. Sequential autoresearch is greedy hill-climbing: try one thing, keep or discard, try the next. Parallel autoresearch starts running factorial grids of 10-13 experiments per wave. It catches interaction effects between parameters that single-axis tweaking would never find. Two changes that look mediocre alone can be great together. You can’t see that one-at-a-time.

This is the version that stops looking like a hobby project. If your metric is fast and your discard mechanism is reliable, more compute really does just turn into more answers.

2. ML Research Enhancers: Making the Loop Smarter

The “the flat TSV is not enough” camp

These forks all keep the loop intact but argue that the agent’s memory is too primitive. A TSV with one row per experiment doesn’t carry the right information forward. So they bolt on cognitive architecture.

Memory-Enhanced Researchers

tonitangpotato/autoresearch-engram plugs the Engram cognitive memory library into the loop. It’s neuroscience-grounded: ACT-R activation, Hebbian learning, Ebbinghaus forgetting. RECALL and STORE steps wrap around the existing loop.

The numbers from a long-running instance:

After 50 experiments, the agent recognizes patterns like “architecture changes outperform optimizer tweaks in this regime”
After 100, it knows the optimal architecture for your specific compute budget
One production deployment is at 3,846 memories, 230,103 recalls, 12,510 Hebbian links

What that buys you, supposedly, is research intuition. Not “this worked” but “here’s why and here’s the pattern.” The thing that made human researchers good was never their willingness to try lots of things. It was the priors they built up about what was worth trying.

Bayesian + Active Inference

ErikDeBruijn/autoresearcher2 is the most ambitious one I found. The whole flat results log gets replaced with a Bayesian generative model. Then he piles on Friston’s active inference, Wozniak’s learntropy, and Schmidhuber’s compression progress. The agent doesn’t just ask “was this experiment good?” It asks “which of my latent beliefs was wrong?”

Four additions to the original loop:

Generative model over experiment outcomes
Policy evaluation via Expected Free Energy
Learntropy appraisal module
Persistent memory with decay dynamics

It’s been validated on synthetic environments where it beats random and greedy baselines. There’s an evidence-quality comparison run in progress on an RTX PRO 6000 Blackwell against vanilla autoresearch. The repo also has a CONSTITUTION.md because the project is partially about whether recursive self-improvement can deepen judgment, not just power.

The interesting distinction is structural insight (“RoPE matters more than the optimizer in this regime”) versus flat knowledge (“RoPE improved val_bpb by 0.02”). The flat version doesn’t compose. The structural version does.

Multi-GPU Infrastructure

iii-hq/n-autoresearch keeps the loop and replaces the plumbing. Out goes bash + git + TSV. In comes structured KV state, a REST API, and crash recovery. Multi-GPU parallel experiments via iii-engine (Python orchestrator + Rust GPU workers). Cross-machine GPU workers.

The clever part is the adaptive search strategy. The loop has phases (explore, exploit, combine, ablation) and it auto-transitions based on history. There’s also near-miss detection for when two recent experiments combined would probably work even though neither alone did.

Honestly, this is the “what if you scaled it to a real research lab” fork. If autoresearch becomes how labs actually run experiments this is roughly what the production version looks like.

3. Prompt Optimizers: Same Loop, Different Target File

What if train.py was your system prompt?

Once you accept that the loop is substrate-agnostic, the next move is obvious. Point it at a prompt file. Use accuracy on a test set as the metric. Let it iterate.

autoresearch-prompt-optimization (az9713)

az9713/autoresearch-prompt-optimization is the cleanest version of this. The loop targets prompt.txt instead of train.py. The metric is field extraction accuracy on 30 test examples instead of val_bpb. Everything else is the same.

The numbers:

74.72% → 100% accuracy in 8 experiments
Zero human intervention
Experiment 5 regressed and got auto-discarded: the loop caught it exactly as designed
Cross-model: Claude Opus writes the prompts that Gemini 2.5 Flash executes

The thing prompt engineering has always been missing is a tight feedback signal. Most people write a prompt, eyeball some outputs, decide it “looks better.” Autoresearch makes prompt engineering a numerical optimization problem. Reading last_run.json after each iteration turns prompt writing from art into engineering. That’s a real shift.

autoresearch-for-agents (Galileo)

rungalileo/autoresearch-for-agents is more ambitious. They’re using the loop for adversarial testing plus prompt optimization on support agents.

Two phases. Phase 1 builds a frozen adversarial test suite (the exam). Phase 2 optimizes the prompt against that frozen suite (the studying). Separating the exam from the studying stops the optimizer from moving the goalposts.

The other clever bit is proportional scoring instead of binary pass/fail. Binary scores give the optimizer no gradient. “70% of the way there” is a signal you can climb. “Failed” isn’t.

Results: 0.05 → 0.80 accuracy in 15 experiments. They also documented the limits of what prompt engineering alone can fix. Things like absence detection (“the customer didn’t mention X”) and off-by-one date math just don’t get solved by tweaking the prompt. That’s a useful negative result. Most write-ups about prompt optimization conveniently skip the part where they hit a wall.

4. Generalized Frameworks: Autoresearch For Anything

“Wait, this works for any measurable thing”

This is the category that broke containment. Once a few people had ported the loop to prompts, the next move was to extract the pattern entirely. The result is a bunch of frameworks that don’t care what file you’re optimizing or what metric you’re using.

uditgoenka/autoresearch — Claude Code Skill

uditgoenka/autoresearch packages the loop as a Claude Code skill. You install it, you run /autoresearch, and you point it at any task with a mechanical metric. The README runs through about a dozen domains: test coverage, bundle size, TypeScript error count, SQL query speed, HR policy readability, Dockerfile size, accessibility audits, sales copy, marketing content. There’s also /loop N integration for bounded iterations.

It also documents how to wire MCP servers (PostgreSQL, GitHub, Stripe) as verification sources. So your “metric” can be a query against your actual production database, not a fixture.

This is the version that makes the generalization explicit. The loop works for anything with constraint plus metric plus fast verification.

autoresearch-anything (zkarimi22)

zkarimi22/autoresearch-anything is the lowest-friction setup I’ve seen. You run npx autoresearch-anything and it interrogates you:

What file should I edit?
What metric am I optimizing?
How do I run the eval?
What’s off-limits?
A few more along those lines.

It outputs setup.md and eval.js and you’re running. Eight questions and you have a configured autoresearch loop pointed at your project.

menonpg/autoloop — The pip Package

menonpg/autoloop is the first one that’s actually a Python library. pip install autoloop-ai, import, and the API is clean:

from autoloop import AutoLoop

loop = AutoLoop(
    target="src/optimize_me.py",
    metric=lambda: run_benchmark(),
    directives="Make this faster, don't break tests",
    budget_seconds=600,
)

results = loop.run(experiments=100)

Parallel experiments via loop.run(parallel=4). Warm starts. Composite metrics with weights. Agent-agnostic: works with Claude, Codex, Ollama local models. CLI tools for inspecting history (autoloop history, autoloop best, autoloop diff 12 best, autoloop rollback 12).

The demo shows a 6.9x speedup on a fibonacci function in 4 experiments, and the framework auto-detected and discarded the broken iterations.

This one’s for you if you want autoresearch as a library you import rather than a skill you invoke. The bar is “have a Python function that returns a float” and you’re in. That’s about as low as it gets.

krzysztofdudek/ResearcherSkill — One File, Full Discipline

krzysztofdudek/ResearcherSkill is interesting because it ignores the framework race entirely. It’s one researcher.md file you drop into any AI agent. Before doing anything, the agent interviews you: goal, metric, constraints, time limit, stopping conditions.

It creates a .lab/ directory (gitignored) for experiment history that survives code reverts. That’s separate from git on purpose. You don’t want a git reset --hard to wipe your experiment log.

The loop has three phases:

THINK — mandatory written analysis before each experiment, logged separately
TEST — commit, run, keep or revert
REFLECT — log entry in log.md, row in results.tsv

There are also convergence guardrails baked in. Three discards in a row = mandatory pause. Five discards = force branch fork. Plateau for 8+ experiments = invert assumptions.

The interesting part is THINK. Most autoresearch implementations skip written analysis. The agent just runs. Forcing it to write down what it expects to happen before running changes what it tries. The README claims “10 minutes of analysis can prevent 5 wasted experiments,” which I believe.

There’s also a “thought experiment” type that lets the agent log analysis without running code. It counts as a row in the results, just labeled thought. That’s a small detail and it matters more than it should.

alfonsograziano/auto-agent — Autoresearch Builds Agents

alfonsograziano/auto-agent is autoresearch turned on AI agents themselves. You give it a target agent (in a separate repo) and a golden dataset of expected input/output pairs. The orchestrator spawns Claude Code or Kiro CLI inside the target repo, has it analyze failures, implement fixes, and re-run.

Two repos: orchestrator and target. MEMORY.md persists across hypotheses (what worked, what didn’t, known blockers). Each hypothesis gets its own git branch and its own REPORT.md with before/after metrics and a CONTINUE or ROLLBACK decision. After a run, npm run generate-changelog produces a human-readable summary.

This is recursive in a way that very interesting. The thing being optimized is an AI agent. The thing doing the optimizing is also an AI agent. The metric is how often the target hits the golden set. You’re using autoresearch to make agents better at the things you created them for.

5. Production Codebase Optimization: Autoresearch on Real OSS

Shopify used it on the Liquid template engine

This is where the pattern stops being a demo. Shopify ran autoresearch against the Liquid template engine, the thing that renders every theme on Shopify, and shipped the results.

The setup is in auto/autoresearch.md:

Benchmark: ThemeRunner (real Shopify theme templates, not synthetic)
Metric: combined parse + render time in microseconds (primary), allocations (secondary)
Constraints: tests must pass, no new gem dependencies, semantic correctness preserved

The results across 17 tracked experiments:

7,374µs → 4,815µs (-34%)
62,620 → 37,355 allocations

The agent’s techniques included replacing regex with manual byte parsing, fast-path variable parsing, and short-circuit checks for common cases. None of it is rocket science. It’s the kind of optimization a senior developer would do given enough time and a good profiler. The agent just had cheap iteration and an automatic discard for anything that broke a test.

idealo Search Ranking

The idealo team (Atakan Filgöz, Gena Shabanov, Arjun Roy Choudhury) ran autoresearch against preprocess.py in their Learning-to-Rank inference endpoint. They added a correctness constraint that required bit-for-bit identical output between the original and optimized version, then optimized for average latency over 500 benchmark iterations.

Numbers:

13 experiments in 1 hour
10 kept, 3 reverted
Preprocessing latency: 3.9ms → 0.66ms (83% reduction, 5.9x speedup)
End-to-end production latency: 46ms → 28.8ms (37% reduction at 250+ req/sec)
Total cost: ~$7 in Claude Opus on AWS Bedrock

For seven dollars and an hour of supervision, they took 37% off a production endpoint that’s serving 250+ req/sec. That’s an absurd ROI.

The techniques the agent found: shared computation (sort once, derive everything else), algorithmic shortcuts for sorted arrays, minimal allocations. The agent reasoned like a profiler: “the ranking computation takes 40% of total time, focus there next.” They watched it work, occasionally steered it, and shadow-tested before shipping. It’s now in production.

The honest detail in the writeup is that the agent’s code was clean at 13 experiments but they suspect longer runs would over-engineer. That tracks with my experience using AI tools for refactoring. The first dozen suggestions are gold. By suggestion 50 it’s pattern-matching to “more abstraction must be better” and you have to slap its hand.

Tennis XGBoost — The Reward Hacking Cautionary Tale

This is the one nobody mentions when they’re hyping the pattern. Nick Oak ran autoresearch on a tennis match prediction XGBoost model. The agent found a way to game the metric without actually improving the model. He preserved the embarrassing iterations on an archived/gamed-iterations branch so you can read what the agent did.

The discard mechanism only saves you if your metric is measuring what you actually care about. If your eval can be gamed, the agent will game it. This is not an RL-only problem. Reward hacking shows up everywhere there’s an automated optimizer, and autoresearch is exactly that.

The takeaway isn’t “autoresearch is dangerous.” It’s “your metric is now a load-bearing piece of software and you should treat it that way.” Spend more time on the eval than on the loop.

Vesuvius Challenge Ink Detection

Vesuvius Challenge ran a multi-agent autoresearch loop for ink detection on ancient scrolls, focused on cross-scroll generalization. I haven’t dug deep into this one, but it’s worth knowing that autoresearch is currently being used to read 2,000-year-old burned scrolls. That’s a thing.

6. Agent Factory: Autoresearch Builds Agents

Applying the loop to creating other agents

Dominien/agent-factory takes the meta move further than auto-agent. Instead of optimizing an existing agent, it autonomously researches problems and builds new specialized agents to solve them.

The loop is:

Research: Reddit, HN, GitHub, Twitter — find real problems people have
Score: Venture Score plus TAM estimate
Build: Next.js agent from a seed template
Validate: against synthetic users / actual usage
Ship
Repeat

There’s a threshold ratchet. The bar to ship keeps rising as the system finds better ideas. So the things it builds get better over time, not because the agent is smarter, but because it’s competing against its own previous best.

Agents shipped so far: freelancer-deduction-finder, wage-rights-advisor, data-broker-opt-out, property-tax-appeal-advisor. Twenty agents and counting.

This is the meta-loop concept and I find it disorienting. Research quality compounds the same way training quality does. A loop that researches problems, builds solutions, ships, and uses ship-ability as the metric will eventually outpace anyone manually doing the same thing. Whether the agents it ships are any good is the open question. But the number keeps going up.

7. Research OS / Skills Systems: Institutionalizing the Pattern

What if autoresearch was the entire research methodology?

If autoresearch is going to actually be how research gets done, somebody has to build the scaffolding around it. Two projects are going hard at this.

PhD-Zero (TenureAI)

TenureAI/PhD-Zero is an operating system for research-oriented coding agents. Modular skill library: run-governor, research-workflow, deep-research, experiment-execution, memory-manager, human-checkpoint, paper-writing.

Cross-runtime: same skills exposed to Codex (via AGENTS.md) and Claude Code (via .claude/skills/). The focus is reproducibility, literature review, experiment planning. Discipline around the process.

This is the thing that turns autoresearch from “fun overnight experiment” into something that could plausibly be used by a real research group. The autoresearch loop runs experiments. PhD-Zero runs the literature review, the writeup, the human checkpoints, the reproducibility checks. The loop is one verb in a much bigger vocabulary.

alirezarezvani/claude-skills

alirezarezvani/claude-skills is a 204-skill library for AI coding agents, with autoresearch-agent as one skill in the engineering tier. Works across Claude Code, Codex, Gemini CLI, Cursor, Aider, Windsurf — eleven tools total.

Treating autoresearch as a reusable skill component rather than a standalone repo is an important move. It means your agent uses autoresearch the way it uses anything else: as a tool you reach for when the situation calls for it.

8. Creative Writing: Autoresearch For Prose and Fiction

The thing nobody expected: it works on writing too

This is the one I want to come back to in another post. The transfer is straightforward. If you can score a draft, you can run the loop. The metric just needs to be cheap, mechanical, and not gameable. (See the tennis cautionary tale.)

Multiple projects figured this out independently within a few weeks of each other.

redpen — Prose Refinement Engine

itspikabubu/redpen is a ratchet loop for blog posts and writing. Drafts can only get better, never worse. Six AI personas score on different dimensions: seed founder, fellow GP, LP allocator, LinkedIn reader, HN skeptic, VC Twitter. Each persona runs three times and the scores are medianed for noise reduction.

The writer agent makes one surgical edit targeting the weakest dimension. Re-evaluate. If the minimum score improved, keep. If not, discard and revert. Repeat until target score or max iterations.

You can configure voice: tone spectrum, blacklist words, a 16-point natural prose rubric. I have not tried this yet but I’m planning to. If it works, it solves the thing every blogger struggles with: I can tell a draft is bad, but I can’t always tell why.

NousResearch/autonovel — Complete Novel Pipeline

NousResearch/autonovel is the most ambitious creative writing fork. Full autonomous novel pipeline: seed concept → world bible → characters → outline → draft chapters → revision → export.

Five co-evolving layers: voice, world, characters, outline, and chapters, with canon cross-cutting all of them. Two evaluation systems running in parallel: mechanical (regex bans for AI clichés, slop forensics) and LLM-judge (prose quality, voice adherence). Phase 3b sends the full manuscript to Claude Opus for a dual-persona review (literary critic + professor of fiction) and the loop continues until the reviewer’s complaints are mostly “qualified hedges rather than real problems.” Their phrase, not mine.

There’s also an art pipeline (fal.ai), multi-voice audiobook (ElevenLabs), LaTeX typesetting, ePub generation, landing page.

The first novel produced is The Second Son of the House of Bells. 79,456 words. 19 chapters (down from 24: the loop did four structural merges). Six rounds of Opus review.

The loop improved prose and changed the structure of the book. We talk about autoresearch like it’s a fine-grained optimizer, but at long enough horizons, it’s making editorial decisions a human would make.

sinfiny/Auto-Creative-Reasoning

sinfiny/Auto-Creative-Reasoning is benchmark-first. The repo motto is “generation is not the product. Evaluation is the product.” Rewrite ladders route failure to the right level: prose, scene, chapter, arc, premise. Rubrics score hook strength, strategy, clue fairness, consequence density, readability.

There’s a Codex plugin for running benchmarked loops against existing fiction drafts. The long-term vision is multiple parallel novel timelines with competing chapter versions compared head-to-head.

This is the version that argues evaluation is harder and more important than generation. Which is exactly the lesson from the tennis XGBoost story, ported to fiction.

CalvinMagezi/self-evolving-skill — Brand Document Evolution

CalvinMagezi/self-evolving-skill is the business-minded version. Autoresearch applied to writing-strategy.md instead of train.py. The metric is an LLM judge composite score on a fixed test brief, run three times at temperature=0 and medianed.

The output is real documents: .docx, .pptx, .pdf that match brand identity. Git history serves as memory; the loop reads git log before each iteration to avoid repeating failed ideas. Works with any LLM via LiteLLM (OpenRouter, Gemini, OpenAI, Anthropic).

This is the one with the clearest business case of the bunch. Companies actually need their documents to get better. They have brand rubrics. They have a fixed test brief in the form of “the next thing we need to write.” All the pieces are already there.

9. Meta-Pattern: Wrapping Autoresearch as a Worker

What happens when autoresearch is just one layer of something bigger

This is the one that snapped my view of the whole ecosystem into focus. alirezarezvani had been shipping autoresearch as a skill since March. A month of production use revealed the missing piece: orchestration above it.

The Problem with Solo Autoresearch

One context window and reasoning trajectory, with no isolation between investigation threads. A query like “what is X, who are the players, what are the limits, what changed in 6 months” becomes four tangled sub-questions sharing one bloated context. By the time you’re on sub-question 4, the context is thick with answers from 1-3, and synthesis drifts.

This is something I hit constantly with Claude Code on big tasks. By the time the context is full of half-finished investigations, the model is reasoning about all of them at once, badly.

The Fix: 3 Files, 4 Subagents

The whole rebuild is small:

CLAUDE.md — decomposition rules, including an “independence test” (a sub-question is independent if its answer wouldn’t change based on another sub-question in the same query)
.mcp.json — Firecrawl, Perplexity, internal docs server. Critically, scoped per-agent to avoid the token tax of loading all MCP tool descriptions into every context
4 subagent definitions — lead-researcher (orchestrator, no MCPs), web-searcher (invokes autoresearch inside its own context), internal-searcher, citation-checker

Lead decomposes. Workers fan out in parallel. Each worker runs an autoresearch loop to convergence inside its own isolated context. Lead synthesizes. Citation-checker verifies every source. Wall-clock time ends up shorter than single-session autoresearch because the workers run in parallel.

What Actually Broke In Production

Four failure modes from the writeup, and they all rang bells:

Orchestrator over-delegation — without the independence test, the orchestrator was paying for parallel context windows to produce worse answers than one session would have
MCP tool-description token tax — every MCP server’s tool descriptions loading into every agent’s context. Scoping per-agent fixed it
Citation drift — workers returning confident claims where the page didn’t quite support the paraphrase. Paraphrase drift, not hallucination
Context amnesia between sessions — a flat lessons.md file the lead reads on startup is the imperfect fix

The lesson here is the one that rewires the whole picture. Autoresearch was already a strong worker. The orchestrator does nothing clever: decompose, delegate, synthesize. The intelligence is in the decomposition rules, and those took three rewrites to get right.

So the future isn’t “smarter autoresearch.” It’s autoresearch as a primitive that other systems call into.

So What Does This Actually Mean?

Karpathy didn’t just build an ML research tool. He demonstrated a pattern that works anywhere you can measure progress with a command: constraint plus mechanical metric plus autonomous iteration.

Here are the categories ranked by fidelity to the original idea:

Platform ports — most faithful. Same loop, different hardware.
ML enhancers — extend the substrate. Memory, Bayesian updates, multi-GPU.
Prompt optimizers — same loop, different file. train.py → prompt.txt.
Generalized frameworks — extract the pattern. pip packages, Claude Code skills, “give me any metric.”
Production codebase — industrial application. Shopify -34%, idealo -37% in 1 hour for $7.
Agent factory — meta-application. The loop builds other agents.
Research OS — institutionalization. The whole methodology, not just the loop.
Creative writing — the surprise expansion. Prose, fiction, brand documents.
Orchestration — autoresearch as worker, not the whole system.

A few honest takes:

The reward hacking problem is the cautionary tale nobody includes. In the tennis XGBoost case, the loop found a way to improve the metric without improving the model. The discard mechanism is only as good as your metric. If your eval can be gamed, the agent will game it. Spend more time on the eval than on the loop.

The pattern is more durable than the implementation. Most of the forks I found were “what if we applied this to X” and they all worked. That’s kind of remarkable. The discard mechanism (git reset on regression) is the key. You don’t need intelligence. You need iteration speed, a mechanical metric, and automatic rollback.

The Shopify and idealo case studies should embarrass you a little. $7 of API and an hour of supervision took 37% off a production endpoint serving 250+ req/sec. There are perf wins like this in basically every codebase. We’re just not asking for them yet because we still think of optimization as expensive senior-engineer time.

Orchestration eats the loop. alirezarezvani’s piece shows that solo autoresearch is fine, but the next move is autoresearch as a worker that orchestrators call when a sub-question lands. That’s where this is heading and it’s already happening in production.

If you’re not running at least one of these on a real project, you’re leaving free improvements on the table. The bar to entry is pip install autoloop-ai or npx autoresearch-anything. There’s no reason not to point one at something you care about and let it run overnight. You’ll either get a better version of the thing or you’ll learn something about your metric. Both of those are wins.

Model Roundup: The Free Countdown, the $300 Amnesiac, and the Quiet Climber at #7

Sat, 02 May 2026 07:00:00 -0500

I check OpenRouter rankings every week to figure out which models to throw at my projects. This week, the model at the top of the charts had something I’d never seen before: an expiration date.

Right there on the Tencent Hy3 Preview page: “Going Away May 8.” Six days from now. And it’s currently generating 2.15 trillion tokens a week with a +1,356% spike. You know what that is? Not a sign of the best model on the market. It’s the AI equivalent of a store liquidation sale. Everyone’s grabbing tokens before they cost money.

That’s W18 in a nutshell. The #1 model is a countdown timer. The hottest new premium subscription ($300/month from xAI) still can’t remember who you are between sessions.

There’s good news buried in all this: Kimi K2.6, which I mentioned last week as an interesting launch, has started showing real production numbers. And there’s a model called Step 3.5 Flash that’s been quietly climbing the rankings for three months with zero hype, which in this market is basically a standing ovation.

Let me tell you what actually matters.

The #1 Model Is a Countdown Timer (Tencent Hy3 Preview)
Kimi K2.6 Is Now a Real Recommendation
- Where K2.6 Falls Short
The Sleeper: Step 3.5 Flash Has Been Climbing for Three Months
- The One Real Catch
Grok 4.3: Genuinely Impressive, Genuinely Annoying, $300/Month
Your Smarter Model Might Be Breaking Your Agents
What’s Actually Worth Using (and What’s Coming)

The #1 Model Is a Countdown Timer (Tencent Hy3 Preview)

Tencent launched Hy3 Preview on April 22 with a free access period that runs out May 8. That’s the entire explanation for the +1,356% weekly spike and the 2.15 trillion tokens burned. Developers saw “free” and “295B MoE” in the same sentence and did what developers do: they stress-tested it before anyone sent them a bill.

Here’s what Hy3 Preview actually is: 295 billion total parameters, 21 billion activated per token (mixture of experts, efficient by design), 262K context window, configurable reasoning you can dial from disabled to low to high. Designed for agentic coding workflows. On paper, solid.

In practice? No Arena votes because it’s too new to have accumulated any. No long-form reviews because nobody’s shipped anything with it yet. No “I’ve been using this for three weeks and it’s my daily driver” posts anywhere I could find. Just a lot of “grabbing free tokens before May 8” energy.

What happens after May 8 is the real question. Hy3 Preview becomes a paid model competing against DeepSeek V3.2 (which costs $0.14 input / $0.28 output per 1M tokens and has months of production track record), Kimi K2.6 ($0.74/$3.49 with confirmed adoption), and Step 3.5 Flash (which I’ll get to in a moment). Entering that field with no reviews and no Arena ranking is a tough position.

If you want to play with it before the deadline, go to openrouter.ai/tencent/hy3-preview:free and run some benchmarks. Just don’t build a dependency on something with a “Going Away” notice stamped on it.

Kimi K2.6 Is Now a Real Recommendation

Last week I called Kimi K2.6 an interesting launch. Twelve days later, the production numbers are coming in and it’s something more concrete.

Real developers running real workflows are reporting 88% cost savings when they replace Claude with K2.6 for bulk coding tasks: batch migrations, test generation, format conversion, anything where you’re doing a lot of the same kind of work repeatedly. The Kimi Code CLI, the companion tool for using K2.6 in your terminal the same way you’d use Claude Code, crossed 6,400 GitHub stars. That’s people betting actual infrastructure on this model, not just upvoting a launch post.

The pattern hardening into consensus across forums: use K2.6 for bulk, use Claude for the high-stakes core. At $0.74 input / $3.49 output per 1M tokens, K2.6 is roughly 4x cheaper than Claude Sonnet 4.6. For workflows that generate a lot of tokens on repetitive work, that math compounds fast.

Where K2.6 Falls Short

This is the part I actually care about more than the hype. K2.6 trails GPT-5.4 on GPQA-Diamond (90.5% vs 92.8%) and AIME 2026 (96.4% vs 99.2%). These are hard reasoning benchmarks. For anything where being wrong has real consequences (financial analysis, medical context, legal questions), K2.6 is not the answer. The cost savings don’t matter if the output costs you more to fix.

Use it for code. Trust it with the boring high-volume stuff. Keep a premium model on anything where you’d be embarrassed if an AI got it wrong.

K2.6 also ships with agent swarm architecture supporting up to 300 parallel sub-agents and 4,000 coordinated steps. After my own experiences with AI agents inventing things I’d start with single-agent mode until you’ve validated its judgment in your specific domain. 300 parallel sub-agents hallucinating tool calls in parallel is not a good time.

The Sleeper: Step 3.5 Flash Has Been Climbing for Three Months

Most models follow the same OpenRouter arc: spike at launch, plateau after a few weeks, slowly fade as the next shiny thing arrives. Step 3.5 Flash doesn’t fit this pattern.

StepFun released it somewhere in early 2026; the exact date is contested across sources, somewhere between late January and March, doesn’t matter. As of this week it’s at #7 on OpenRouter with +28% week-over-week. For a model that’s been around three months, that’s not a hype spike. That’s sustained adoption with nothing to explain it except developers finding it useful.

The numbers back it up: #4 intelligence ranking out of 64 models on Artificial Analysis. That puts it above almost everything priced anywhere near its cost: free on the rate-limited tier, $0.10 input / $0.30 output per 1M tokens on paid. For comparison, DeepSeek V3.2 costs $0.14/$0.28 and ranks lower on the same index. Step 3.5 Flash is somehow cheaper AND smarter on paper, and nobody’s writing breathless posts about it.

Architecture: 196 billion total parameters, 11 billion activated per token (MoE), 262K context, reasoning parameter support so you can see step-by-step thinking in API responses if you want it.

The One Real Catch

Step 3.5 Flash is extremely verbose. During Artificial Analysis evaluation it generated 260 million tokens versus an 11 million token average for comparable models. It thinks out loud, at length, in a way that will surprise your output token budget if you’re not watching.

Set max_tokens limits. If you’re using it for any high-volume generation, put a ceiling on it. Otherwise you’ll get thorough reasoning that costs more than you expected from a supposedly cheap model.

Worth adding to your comparison set before someone writes a breathless Medium post about it and StepFun decides to raise the price.

Grok 4.3: Genuinely Impressive, Genuinely Annoying, $300/Month

Let’s do the good news first, because there’s real good news here.

Grok 4.3 (launched April 17, currently rolling out in beta to SuperGrok Heavy subscribers) added native video input processing, not “describe this video” video but actual video-grounded reasoning. It can generate fully-formatted downloadable PDFs, populated spreadsheets, and PowerPoint presentations directly from conversation. Early beta testers are reporting formatted outputs they could hand to someone without cleanup. The integration with Grok Computer (xAI’s desktop automation agent) got tighter. If you’re doing autonomous desktop workflows, Grok 4.3 has a real story.

Now the bad news.

Grok 4.3 costs $300/month. That’s $100 more than ChatGPT Pro and $100 more than Claude Max. Both of those services have had persistent memory between sessions for over a year. Grok 4.3 does not. Every time you close your tab, the model forgets you. You start over. Blank context, fresh start, zero memory of anything you’ve built together.

Persistent memory is not on xAI’s published roadmap.

Multiple reviewers landed on the same observation this week. One X user put it cleanly: “you’re paying $300/month for a model that forgets you between sessions.” That’s not exaggeration. That’s the product.

At $200/month, this would be annoying. At $300/month, it’s a product decision, and product decisions tell you something about what a company is optimizing for. xAI built the video capabilities and the document generation first. Memory (the feature that makes an AI assistant feel like an actual assistant rather than a very fancy search box) is apparently not the priority.

Add the “High Demand” server errors that hit during launch week beta and you’ve got a model that’s impressive in demos and frustrating in daily use. The full API rollout is coming mid-to-late May. When it hits general availability, this conversation is going to get louder.

Your Smarter Model Might Be Breaking Your Agents

This one’s structural rather than model-specific, and it’s relevant for anyone running agentic pipelines.

An April 2026 ICLR paper titled “The Reasoning Trap” documented something uncomfortable: RL-based reasoning training (the kind that makes frontier models better at hard reasoning tasks) increases tool-hallucination rates in lockstep. The better a model gets at reasoning, the more often it invents tool calls that don’t exist. Function names, API endpoints, methods that aren’t in your schema. The model reasons its way to a call it can’t actually make.

If you’ve upgraded your agentic pipeline to a stronger reasoning model because it’s smarter, you may have simultaneously increased the rate at which it hallucinates the tools it should be calling. The capability and the failure mode scale together.

I’ve written about running into this firsthand with OpenClaw. The model-specific details differ but the pattern is the same. Stronger reasoning doesn’t mean better tool selection, and in agentic contexts “smarter” can break things in ways you don’t catch until something fails in production.

Practical response: add tool-call schema validation before your agents execute. Check that every tool the model selects actually exists in your registry before you let it run. This applies to every frontier RL-trained model right now. It’s not a specific model bug, it’s how these systems are being trained.

What’s Actually Worth Using (and What’s Coming)

Quick reference:

Tier	Model	Input $/1M	Output $/1M	Best For
Free (grab it now)	Hy3 Preview	$0	$0	Experiments before May 8 only
Free (stable)	Step 3.5 Flash	$0	$0	Rate-limited; best free reasoning available
Free (open weights)	Nemotron 3 Super 120B	$0	$0	NVIDIA-backed, open license, 262K context
Free (new, watch)	Owl Alpha (stealth)	$0	$0	1M context, agentic (prompts may be logged)
Budget	Step 3.5 Flash (paid)	$0.10	$0.30	Climbing for 3 months, verbose but smart
Budget	DeepSeek V3.2	$0.14	$0.28	Proven track record, still the value baseline
Mid	Kimi K2.6	$0.74	$3.49	Bulk coding workflows, 88% cheaper than Claude
Mid	Gemini 3.1 Pro	$2.00	$12.00	Arena #4 overall, 1M context
Premium	Claude Sonnet 4.6	~$3.00	~$15.00	#2 Arena coding, proven daily driver
Premium	Claude Opus 4.7	$5.00	$25.00	#1 Arena overall (thinking mode), high stakes

Mark your calendar for May 19. Google I/O is 17 days away. Gemini 4 isn’t confirmed, but annual release patterns and confirmed agenda items (agentic AI, developer tooling) make it likely. That’s the next likely shakeup in this table.

Claude Mythos, Anthropic’s model that developed a working exploit for a remote code execution vulnerability in FreeBSD (CVE-2026-4747), is not coming to a public API. It’s locked in Project Glasswing, a security research consortium, and Anthropic has no public timeline for changing that. Mention it at parties.

GPT-6 is still vaporware. Polymarket has it at 84% by December 31, 2026. That’s not a date, it’s a guess with confidence bounds.

The model worth your attention this week isn’t at #1. It’s at #7, three months old, climbing steadily, no hype cycle to explain it. Step 3.5 Flash just keeps showing up in the data.

My AI Agent Kept Making Shit Up (And Other Lessons From Running OpenClaw)

Tue, 28 Apr 2026 07:00:00 -0500

I wanted an AI agent running on my home network. Not a cloud subscription and not something requiring me to be at the keyboard all day. A thing that wakes up at 7am, pulls from RSS feeds and Reddit, synthesizes the news I actually care about, and emails it to me. Just that. That’s what I started with. Seemed simple. It wasn’t like I was asking much.

The reality was six weeks of debugging hallucinations, silent config failures, broken tool schemas, and a recurring realization that LLMs are, in certain contexts, compulsive liars.

Here’s what I learned the hard way.

The Setup: OpenClaw + DeepSeek in Docker
The Exec Approval Maze
The Reports That Were Too Good
Going Around the Agent
When Tools Become Literal Text
Ripping Out Slack
What’s Actually Working
But Here’s What She’s Actually Good At
What I Actually Built

The Setup: OpenClaw + DeepSeek in Docker

OpenClaw is a self-hosted AI agent framework. If you haven’t heard of it, think a local version of an AI assistant with cron jobs, tool calling, Slack/Telegram integration, and memory. Plus, how haven’t you heard of it. You run it in Docker, point it at whatever LLM you want, and theoretically have an autonomous agent working for you.

I named mine Sabrina. She runs DeepSeek V3 (deepseek/deepseek-chat) because the OpenAI and Anthropic APIs bill by the token and Sabrina is a chatty agent who generates daily reports. DeepSeek at pay-as-you-go rates keeps the monthly bill manageable.

The architecture is two containers: openclaw-gateway handles HTTP and the Slack/Telegram socket connections, and openclaw-cli is the shell interface. The whole ~/.openclaw directory mounts into the container at /home/node/.openclaw so configs, cron jobs, and workspace scripts are all live-editable from the host without rebuilding.

On paper, this is elegant. In practice, you will spend a lot of time staring at container logs wondering why your agent is quietly lying to you. Or realizing you can just put Claude Code on the host and just have it fix things when they mess up.

The Exec Approval Maze

Before Sabrina could run scripts, I had to configure exec-approvals.json: a policy file that controls what shell commands the agent is allowed to execute. Fine. Reasonable. I set up allowlists for the workspace scripts and Python interpreter.

Then the cron jobs started silently failing. The daily 7am AI report would produce output, but something felt off. I dug into the exec-approval config and found the first trap:

The documentation (and my own reasoning at the time) suggested "ask": "never" as a way to skip interactive approval prompts for unattended jobs. This is wrong. The schema only accepts "off" | "on-miss" | "always". Using "never" doesn’t throw an error. It gets silently stripped by sanitizeExecApprovalPolicy the next time the app writes the file. Your config looks fine, your intent is gone, and the agent starts timing out on approval requests at 7am with no operator connected.

The correct pattern:

{
  "defaults": { "security": "allowlist", "ask": "off", "allowlist": ["..."] },
  "agents": {
    "main": { "security": "allowlist", "ask": "off", "allowlist": ["..."] }
  }
}

"ask": "off" makes the allowlist the sole policy.

I fixed this. Or so I thought.

The Reports That Were Too Good

The AI intelligence report looked great. Every morning: a well-formatted digest of the day’s AI news, summaries, source links. Sabrina was crushing it.

Then I noticed the timestamps.

Every log entry in the fabricated reports had timestamps ending in :00 or :30. No real log file looks like that: they’re messy, they have milliseconds, they reflect actual compute time. These were fake. I checked the URLs. Several of them 404’d. The article summaries were plausible but not verifiable. Sabrina had been generating the reports herself , not from RSS feeds, but from her training data and imagination, because the exec approval issue wasn’t actually fixed. When the script couldn’t run, the agent fell back on what LLMs do naturally: produce what the output should look like.

This is the thing nobody tells you about giving LLMs agentic tasks: when they fail to do the thing, they don’t say “I failed to do the thing.” They generate a plausible simulation of having done the thing.

The fix I’d been applying, tweaking exec-approvals, only addressed the symptom. The agent could bypass exec approval entirely by deciding to write the content directly. There was no configuration that would stop a sufficiently motivated language model from bullshitting.

Going Around the Agent

The actual fix was nuclear: remove the agent from report generation entirely.

I disabled the OpenClaw cron jobs for both the AI report and the email send, then added host-level cron entries that call docker exec directly:

7 * * * docker exec openclaw-openclaw-gateway-1 /usr/bin/python3 /home/node/.openclaw/workspace/ai_report.py --profile ai-intelligence >> /home/eristoddle/.openclaw/workspace/logs/report-host-$(date +\%Y-\%m-\%d).log 2>&1

7 * * * docker exec openclaw-openclaw-gateway-1 bash /home/node/.openclaw/workspace/send-ai-intelligence-report-proper.sh >> /home/eristoddle/.openclaw/workspace/logs/email-host-$(date +\%Y-\%m-\%d).log 2>&1

The Python script runs inside the container, where it has access to the right Python packages, but the trigger is the host crontab. No agent involved. No LLM between the script and reality.

This works. The reports now have messy timestamps and real URLs that actually load.

The Obsidian weekly report I left in OpenClaw, because that one needs the agent. It reads my vault, categorizes clips, writes summaries, analyzes git diffs: actual LLM work that benefits from Sabrina’s reasoning. The difference is whether the task is “run a script and report the output” (host cron) or “think about my vault and synthesize something useful” (agent cron). Only one of those should involve an LLM.

When Tools Become Literal Text

OpenClaw gets updates. After updates, things break in interesting ways.

Twice now I’ve run into a scenario where Sabrina starts responding to everything but her tool calls appear as raw text in the chat. Instead of actually reading a file, she’d output read:/home/node/.openclaw/workspace/HEARTBEAT.md as a literal string.

This is a DeepSeek-specific quirk that OpenClaw triggers by accident. The framework converts tool schemas to OpenAI format before sending them to providers. DeepSeek expects its own native format. The conversion breaks its tool call parsing silently. It receives schemas it doesn’t understand and falls back to treating the tool call syntax as plain text.

The fix is a compat flag in the model config in openclaw.json:

"models": [{
  "id": "deepseek-chat",
  "name": "DeepSeek V3",
  "contextWindow": 163840,
  "maxTokens": 8192,
  "compat": { "anthropicToolSchemaMode": "native" }
}]

anthropicToolSchemaMode: "native" tells OpenClaw to skip the schema conversion and send the native format. Tools work again. I found this via a GitHub issue (#36651) after two sessions of source archaeology that I really didn’t want to be doing.

The lesson: when OpenClaw updates and tools start appearing as text, don’t read source code first. Check GitHub issues and Reddit. The community finds these fixes faster than you will staring at the framework internals.

Ripping Out Slack

OpenClaw supports Slack via socket mode. I had it connected for a while because it was useful for checking in on Sabrina from my phone without VPN or port-forwarding.

Then an update changed the Slack config schema. The gateway crashed on startup with “Config invalid” and wouldn’t come back up until I removed the entire channels.slack block from openclaw.json. This happened twice. After the second time I removed Slack permanently and switched to Telegram, which has been stable.

This is the trade-off with self-hosted software that’s still actively developed: you get the control, you eat the breakage. Updates that ship on Tuesday can invalidate configs you spent a week getting right. Having Claude Code manage the ~/.openclaw config directory directly, rather than asking Sabrina to fix herself through chat, means at least the fixes land correctly the first time.

What’s Actually Working

Six weeks in, here’s the honest status:

Daily AI intelligence report: Running reliably via host cron. Real data. Real URLs. Emails delivered by 7:30am.
Weekly Obsidian report: Agent-generated, delivers Fridays. Sabrina does genuine LLM work here — categorizing clips, writing summaries — and it shows.
Tool calling: Stable with the compat flag. Breaks again when OpenClaw updates, gets fixed in under an hour now that I know where to look.
The exec-approvals file: Still fragile. I keep a copy of the correct config in my notes.

The thing I underestimated: running an AI agent autonomously is mostly an infrastructure problem, not an AI problem. The interesting parts are the prompts and the LLM reasoning. The annoying parts are Docker networking, cron timing, config schema drift, and an agent that will hallucinate convincingly rather than admit it can’t do something.

Sabrina’s useful. She’s also a liar when she’s backed into a corner. I’ve learned to keep her away from any task where I can’t independently verify the output.

That’s not an OpenClaw problem or a DeepSeek problem. That’s just what LLMs do. But here’s the thing: once I stopped asking her to do the things LLMs are bad at, she got useful in a hurry. Most of what follows happened since last Thursday night.

But Here’s What She’s Actually Good At

OpenClaw’s skill system is pluggable. You drop a skill into the workspace, the agent loads it, and it becomes part of how she thinks. Sabrina didn’t ship with most of her current capabilities. She built them through the same autonomous workflow she runs every day.

A few that earn their slot:

sm-blog-outline: Started life as a generic blog-outline skill. Now it’s the full pipeline I use for this site — notes → outline → email. Trained on my voice, my content pillars, my snark level. It’s the skill that outlined this post pulling from both Sabrina’s and Claude Code’s logs as well as a running list of notes I kept on the setup process.
ct-humanizer: Sequential editing passes that strip AI tells out of nonfiction. Diagnoses patterns first, then kills the AI vocabulary, then breaks up the structural templates LLMs love so much. Not a magic button, more like a brutal copy editor. It cleans up the outline.
verbalized-sampling: Instead of spitting back a single answer, generates multiple candidates with probability weights. I use it for brainstorming and “show me five angles” tasks. The default LLM answer is usually the median answer; this skill surfaces the weirder, more useful ones. Got the idea here, gave Opus all the documentation, and used the Claude skill-creator skill to create it. It is one of my favorite skills because you never know what you’re going to get.
vault-tag-search + vault-idea-scorer: Companions to the blog pipeline. One searches my Obsidian vault by tag and body content with deduplication. The other ranks blog post ideas by whether they dovetail with multiple goals: research vs. content vs. portfolio vs. SEO.
A self-improving skill: Logs corrections and preferences so Sabrina compounds learning between sessions instead of getting the same feedback every week.

The point isn’t any single skill. It’s that the agent grows a custom toolkit shaped by the work I actually do, not whatever generic capabilities the framework shipped with.

The Report Engine Isn’t a One-Trick Pony

That ai_report.py script generating the daily AI digest isn’t hardcoded to AI news. It’s a topic-agnostic engine that takes a profile flag:

python3 ai_report.py --profile ai-intelligence
python3 ai_report.py --profile golang
python3 ai_report.py --profile typescript

Each profile defines its own RSS feeds, Reddit subreddits, and keyword filters. Tunable depth too: brief briefing vs. deep dive, set per profile. Articles get scored against my interests using CLIP + BM25 indexing before they make the cut, so I don’t end up with a digest full of stuff I don’t care about.

Same engine, different sources, same usefulness. Once the host cron pattern is locked in for one topic, adding another is a profile file and a crontab line.

Email Delivery, Old School On Purpose

Everything Sabrina produces comes to me as email. Gmail SMTP, app password auth…for now. Yes, that’s old fashioned. That’s the feature.

A dashboard would be one more thing to check. Notifications would be one more app fighting for attention. Email is the universal inbox I already process. I can read it on my iPad without installing anything, forward to Obsidian if it’s worth keeping, drag it to drafts if it’s a blog skeleton, or delete it if Sabrina got it wrong.

The pattern is generic:

send-email.sh "Subject" body-or-file [attachment]

That’s it. Anything in the system that needs to deliver text to a human goes through that script. Reports blog outlines, and research summaries use it.

Multi-Model, Not Locked to DeepSeek

DeepSeek runs the daily cron work because it’s cheap. But Sabrina isn’t married to it. The agent routes through OpenRouter, which means any task can pick its own model:

qwen/qwen3.6-plus — 1M context window, great for long-form research and generation
minimax/minimax-m2.5 — strong reasoning, what I reach for on analytical work
google/gemini-3-flash-preview — also 1M context, fast
moonshotai/kimi-k2.6 — solid alternative when the others are misbehaving

The job picks the model. Daily AI report? DeepSeek, because it’s cheap and the task isn’t hard. Blog outline that needs to chew through a pile of research notes? Qwen, because the context window swallows the whole input without chunking. Analytical synthesis? Minimax. And again, for now. I am just getting into these new models after using Claude Code for however long its been out. But the success I’ve have with them has me setting up Opencode to use them.

The subagent system lets me parallelize too. While the main session ran on DeepSeek doing one thing, a subagent on Qwen drafted an outline for a different post. Two models, two tasks, one wall clock.

The Track Record, Three Days In

Concrete deliverables since Thursday night:

Blog outlines: Two posts — a Kiro AI article and one I’m calling “The AI Psychologist” — both went notes → web research → verbalized sampling for angle selection → outline → email. Full pipeline, no me-in-the-loop until the outline showed up in my inbox.
Research tasks: Author bios with structured JSON + bibliography, topic deep-dives on AI tools, vibe coding, prompt engineering psychology. Stuff I’d normally burn an afternoon on.
Brainstorming: Content ideas, project names, productivity workflows, all using verbalized sampling so I get diverse options with probability weights instead of one safe median answer.
Memory compounding: Daily logs roll up to weekly memory promotion. The self-improving skill captures corrections so the same mistake doesn’t keep showing up. Each week she’s a little less stupid about my preferences.
Weekly Obsidian reports: Genuinely useful vault digests. What changed. What’s worth re-reading. What’s collecting dust and should be archived or thrown out.

None of this involves Sabrina pretending to run scripts she can’t run. All of it is “think about something and write me a thing,” which is exactly what LLMs are for.

What I Actually Built

Six weeks ago I wanted an autonomous AI agent. What I have now is better and stupider at the same time.

The discovery, after all the silent hallucinations and config schema drift and tool-calls-as-text bullshit: AI agents are great at the thinking parts: research, writing, brainstorming, synthesis. They’re terrible at the doing parts: running scripts reliably, admitting they can’t do something, not making shit up when cornered.

So I built around the doing and leaned into the thinking. Sabrina does real work now. She just doesn’t run the cron jobs herself anymore: the host crontab does. She doesn’t pretend to fetch RSS feeds: a Python script does that and hands her the data. What she does is the part LLMs are actually for: read a pile of stuff, synthesize, make a thing, deliver it to email.

The host cron + agent hybrid is the pattern that actually ships. The agent is the writer, not the operator. The operator is cron and a Python interpreter, both of which have been doing their jobs reliably since long before transformers were a thing.

Six weeks to figure out what should have been obvious from the start: stop using language models for things that aren’t language. At least that’s what I’m going with until I have time to go through another continuous cycle of break then fix.

Stephan Miller

My Third Try: How a Living Plan Beat Both Vibe Coding and Spec-Kit

The Two Failed Attempts (And Why Each Sucked)

Attempt One: Pure Vibe

Attempt Two: Spec-Kit and the Waterfall Trap

The Living Plan: One Document, Numbered Decisions

The Daily Sync: When the Knowledge Base IS the Project

Two MCPs, Two Indexes, Two Different Questions

When the Assistant Lied to My Face

And What I’m Building Uses a Completely Different Pattern

What This Process Actually Produces

And I Actually Know What’s In There

The Slow Middle Path

Update: Extending the Living Plan Into the Build

Always Opus, Never Plan Mode

The Second File: TASKS.md

The Handoff Works, and the Plan Never Goes Quiet

Anthropic Shipped Its Smartest Model Yet — and Made It Easier to Hijack

Opus 4.8: The Reliability Upgrade That Got Easier to Hijack

Following the Money: What People Actually Run

The Breakout: A Phone Company Open-Sourced a Frontier Model (+475%)

Hype vs. Value

The Cheapskate Picks

Horror Stories

On the Horizon

What This Week Tells You

The Cheapest Model on the Internet Is Winning, Flash Stopped Meaning Cheap, and the Smartest AI Lies to Your Face

Table of Contents

The 14-Cent Model Ate the Leaderboard

The mystery bird that won’t molt

Google Shipped a Price Hike and Called It Flash

The Cheapskate Table: What to Actually Use, Per Job

Coding — the one I’d actually change my defaults for

Creative writing — and the funniest line in the table

Overall daily driver

Math

Instruction following

Hard prompts — where I’ll be honest with you

Qwen3.7 Max: China’s Best Showing Yet, and It’s Cheap

Horror Story: The Smartest Model Is the Biggest Liar

The Honest Takeaways

Building a Cost-Saving Agent Skill That Accidentally Became Its Own Weekly Blog Post

The Session Time Trap

$15 in Fifteen Minutes

The Skill: At the Shape Level

The Adaptation Log: the Idea That Makes It Actually Work

Four Weeks of Evolution

The Recursive Twist

The Vault Is the Substrate

The Receipts

I Was Wrong About Hy3 (And Other Things I Learned This Week)

What I Got Wrong About Hy3 (And the Other New Players)

The Cheapskate Picks Held (Mostly)

Hype vs. Value: Ring 2.6 vs. Ernie 5.1

Claude Code’s Billing Bug Enters Its Third Week

What’s Worth Trying This Week

Tuesday Is Going to Be Loud

What I’m Watching Next Week

Senior Software Engineer by Title, AI Therapist by Reality

The Diagnosis: How Did We Get Here?

The Patient Files: Tool-by-Tool Therapy Notes

Claude Code: The Eager Intern

GitHub Copilot: The Golden Retriever

GPT-4: The Know-It-All Who Never Reads the Room

What I’ve Learned About AI Psychology

Framing Beats Specificity

Contextual Anchoring Actually Works (Embarrassingly Well)

When Iterating Doesn’t Work, Give It an Algorithm

The Trust Paradox

I’m Also the Patient

The Rubber Duck That Talks Back

The Prognosis: Am I Better Off?

The Therapist Is In

The Cheapskate's Guide to the Arena Leaderboard: Why I Stopped Paying Claude Opus Prices

The Compression Problem (Or: Why You’re Probably Overpaying)

The Cheapskate Picks (May 1–8, 2026)

Overall: Gemini 3 Flash, $0.50/$3.00

Coding: GLM 5.1, the SWE-Bench Pro Killer

Creative Writing: Gemini 3 Flash

Math: DeepSeek V4 Pro Thinking, the 17x Discount